Versioning Datasets
Snapshot an evolving training set as commits, and let dedup keep unchanged shards from being stored twice.
Training data is rarely static. You add a batch of samples, fix some labels, drop a bad shard, and suddenly you have data-v1/, data-v2-final/, and data-v2-final-REAL/ sitting side by side, each a near-complete copy of the last. Versioning the directory with dits gives you one history and stores only what actually changed.
If you have not set up the engine yet, see Getting Started. The dedup behavior below follows directly from how Chunking works.
Snapshot a version
Initialize once, then commit each revision of the dataset. The message is your changelog — record what was added, removed, or relabeled.
cd datasets/instruct-mix
dits init
dits add data/
dits commit -m "v1: 1.2M samples, 8 shards"
# Later, after appending a new batch:
dits add data/
dits commit -m "v2: + 40k samples, fixed 300 labels"Compare versions
Use dits log for the version timeline and dits diff to see which shards changed between two revisions — useful when you need to know exactly what entered a given training run.
dits log
dits diff <v1-commit> <v2-commit>What dedupes well today
Additive changes are the sweet spot. When v2 appends new shards and leaves the existing ones byte-for-byte identical, those shared shards are stored once and referenced by both commits — v2 only costs you the genuinely new data. The same holds across shards that repeat identical content.
What does not dedupe yet
Exact-match chunking treats a near-duplicate as a brand-new file. Two augmentations of the same image, or a sample with one token changed, share almost everything semantically but differ in bytes — so today they are stored in full, twice.
dits push is on the roadmap and not available yet. For now, dataset history lives in the dataset directory.Related
- Versioning Checkpoints
- Fine-Tunes & Variants
- CLI Reference (shared engine)