Workflows

Versioning Datasets

Snapshot an evolving training set as commits, and let dedup keep unchanged shards from being stored twice.

Training data is rarely static. You add a batch of samples, fix some labels, drop a bad shard, and suddenly you have data-v1/, data-v2-final/, and data-v2-final-REAL/ sitting side by side, each a near-complete copy of the last. Versioning the directory with dits gives you one history and stores only what actually changed.

If you have not set up the engine yet, see Getting Started. The dedup behavior below follows directly from how Chunking works.

Snapshot a version

Initialize once, then commit each revision of the dataset. The message is your changelog — record what was added, removed, or relabeled.

cd datasets/instruct-mix
dits init

dits add data/
dits commit -m "v1: 1.2M samples, 8 shards"

# Later, after appending a new batch:
dits add data/
dits commit -m "v2: + 40k samples, fixed 300 labels"

Compare versions

Use dits log for the version timeline and dits diff to see which shards changed between two revisions — useful when you need to know exactly what entered a given training run.

dits log
dits diff <v1-commit> <v2-commit>

What dedupes well today

Additive changes are the sweet spot. When v2 appends new shards and leaves the existing ones byte-for-byte identical, those shared shards are stored once and referenced by both commits — v2 only costs you the genuinely new data. The same holds across shards that repeat identical content.

Tip

Keep shard files stable. If your pipeline rewrites or re-shuffles every shard on each export, the bytes change even when the samples don't, and dedup can't see the overlap. Append new shards rather than rewriting old ones.

What does not dedupe yet

Exact-match chunking treats a near-duplicate as a brand-new file. Two augmentations of the same image, or a sample with one token changed, share almost everything semantically but differ in bytes — so today they are stored in full, twice.

Warning

Near-duplicate samples (augmentations, lightly edited records) do not dedupe with exact-match chunking. Similarity-based dedup that collapses near-duplicates is on the roadmap — see Similarity Dedup.

Note

Pushing a dataset repository to shared storage via dits push is on the roadmap and not available yet. For now, dataset history lives in the dataset directory.

Versioning Datasets

Snapshot a version

Compare versions

What dedupes well today

What does not dedupe yet

Related