Getting Started

Why Dits for AI

Checkpoints and datasets deserve real version control — not a folder full of renamed copies.

The problem

Most teams “version” their model artifacts by renaming files in an object store: model-v1.safetensors, model-final.safetensors, model-final-actually.safetensors. Every one of those is a full copy of the previous, even when 99% of the bytes are identical. A 7B checkpoint saved ten times is seventy gigabytes of mostly-redundant storage, and there is no way to ask what actually changed between two of them, or where a given artifact came from.

Buckets store opaque blobs. Git LFS does the same thing with extra ceremony — it tracks pointers to whole files and ships the entire object whenever any byte changes. Neither one understands the shared internal structure of two checkpoints, so neither can dedup across runs, variants, or shards.

What Dits does

Dits treats an artifact as a stream of content-addressed chunks. Two checkpoints that share most of their weights share most of their chunks on disk, so storing the second one costs only the bytes that genuinely differ. Because every chunk and every commit is addressed by its content hash, you get real commits, real diffs between any two versions, and real lineage — the same primitives Git gives source code, applied to artifacts that are gigabytes instead of kilobytes.

Working today: a content-addressed store, BLAKE3 verification, FastCDC chunking with exact deduplication, byte-exact reconstruction, and local history — commit, add, status, diff, log, checkout, and branch. See how it works for the full pipeline.

How it relates to Xet and DVC

Hugging Face Xet and DVC share the same foundation Dits builds on: content-defined chunking (CDC) over a content-addressed store (CAS). That foundation is what makes chunk-level dedup possible at all, and Dits implements it today. Where Dits is headed is the addressing layers above it — tensor-aware chunking, similarity-based dedup, and derivation/recompute — so that two artifacts can share structure even when their raw bytes are merely similar rather than identical.

Note

Be precise about status. Local content-addressing, exact dedup, and history all work today. Networked sync (push/pull/fetch) and the L2/L3 similarity and derivation layers are on the roadmap, not shipped.

Next steps

Quick start — install and make your first commit.
Addressing — how content hashes make dedup and lineage work.
The AI overview and benchmarks.