How Dits for AI works
The same engine as Dits for media, pointed at weights and datasets. Here's the mental model in three addressing layers — including where today's approach genuinely struggles.
Every other tool addresses data by the hash of its bytes. That is the ceiling: two things that are 99% the same but not byte-identical share nothing. Dits is built to address data three ways — exact, similar, and derived — so dedup keeps working exactly where the biggest AI data lives.
Content-addressed bytes
An artifact is split into content-defined chunks with FastCDC, each addressed by its BLAKE3 hash. Two chunks with identical bytes get the same address and are stored once. Reads are always integrity-verified against the hash. This is the foundation every content-addressed store shares — and it ships today on the Dits engine.
Wins on: base model vs. fine-tune, model vs. quantized/LoRA variant, shared dataset shards, appended data.
Similarity-addressed
Exact hashing only catches byte-identical chunks. But datasets are full of near-duplicates — augmented images, re-encodes, lightly-edited samples — and so are media libraries. L2 addresses a chunk by a perceptual or semantic fingerprint, finds the closest existing chunk, and stores a small delta against it. Near-duplicate becomes near-zero storage.
Wins on: deduping training data, augmentation pipelines, and re-encoded media that exact-match misses entirely.
Derivation-addressed
Many heavy artifacts are reproducible. A quantized model is derivable from its source. A LoRA is a recipe applied to a base. A checkpoint is reproducible from (data refs + config + seed). L3 stores the recipe — a few hundred bytes — and recomputes the artifact on demand instead of storing the bytes, trading cheap compute for expensive storage.
Wins on: quantized variants, derived exports, and reproducible checkpoints where storing terabytes is wasteful.
Where byte-level chunking struggles
We won't oversell it. Model weights are float tensors, and a single gradient step nudges almost every weight a little. The bytes change everywhere, not in localized regions — which is the opposite of what content-defined chunking is best at. So naive byte-level dedup between two consecutive checkpoints performs poorly.
Where L1 wins today: base model ↔ fine-tune, model ↔ quantized/LoRA variant, shared dataset shards, and append-style changes. Closing the consecutive-checkpoint gap needs tensor-aware chunking (diffing tensors in their own domain, not as raw bytes) — which is the next milestone on the engine roadmap, not a shipped claim.
See the numbers
Storage and bandwidth projections for real checkpoint and dataset workloads — clearly labeled measured vs. modeled.