Skip to main content
Dits for AI docs
Workflows

Versioning Checkpoints

Replace ckpt-step sprawl with a real, navigable history of every checkpoint your run produces.

A long training run leaves behind dozens of weight files —ckpt-step-1000.safetensors, ckpt-step-1500.safetensors, and so on — piled into a single output directory. You lose track of which one was best, what changed between them, and why you kept any of them. Treating the run's output directory as a dits repository turns that pile into commit history you can navigate.

New to the engine? Start with Getting Started, and skim Chunking to understand how large files are stored as deduplicated content chunks.

Initialize the run directory

Run dits init wherever your trainer writes checkpoints. From then on, each checkpoint becomes a commit instead of a new filename.

cd runs/llama-sft-2026-06
dits init

# After your trainer saves the first checkpoint:
dits add model.safetensors
dits status
dits commit -m "step 1000: lr 2e-5, loss 1.84"

Commit each checkpoint

Keep the filename stable (for example always model.safetensors) and let commits carry the version. Repeat after every save so the message records the step, hyperparameters, and metrics that matter.

dits add model.safetensors
dits commit -m "step 1500: loss 1.71"

dits add model.safetensors
dits commit -m "step 2000: loss 1.63 (best so far)"

Inspect history

Use dits log to see the full checkpoint timeline, and dits diff to see which underlying chunks changed between two checkpoints — a quick signal for how much the weights moved.

dits log
dits diff <prev-commit> <this-commit>

Restore a past checkpoint

When a later step overfits, roll back to the best one. Check out the commit and your model.safetensors is restored to that exact state — ready to resume or export.

# Restore the working tree to a specific checkpoint
dits checkout <commit>

# Or branch off it to explore a different schedule
dits branch resume-from-2000

Related