Skip to main content
The honest version

How Dits for AI works

The same engine as Dits for media, pointed at weights and datasets. Here's the mental model in three addressing layers — including where today's approach genuinely struggles.

Every other tool addresses data by the hash of its bytes. That is the ceiling: two things that are 99% the same but not byte-identical share nothing. Dits is built to address data three ways — exact, similar, and derived — so dedup keeps working exactly where the biggest AI data lives.

L1 · ExactWorking today

Content-addressed bytes

An artifact is split into content-defined chunks with FastCDC, each addressed by its BLAKE3 hash. Two chunks with identical bytes get the same address and are stored once. Reads are always integrity-verified against the hash. This is the foundation every content-addressed store shares — and it ships today on the Dits engine.

Wins on: base model vs. fine-tune, model vs. quantized/LoRA variant, shared dataset shards, appended data.

L2 · SimilarRoadmap

Similarity-addressed

Exact hashing only catches byte-identical chunks. But datasets are full of near-duplicates — augmented images, re-encodes, lightly-edited samples — and so are media libraries. L2 addresses a chunk by a perceptual or semantic fingerprint, finds the closest existing chunk, and stores a small delta against it. Near-duplicate becomes near-zero storage.

Wins on: deduping training data, augmentation pipelines, and re-encoded media that exact-match misses entirely.

L3 · DerivedRoadmap

Derivation-addressed

Many heavy artifacts are reproducible. A quantized model is derivable from its source. A LoRA is a recipe applied to a base. A checkpoint is reproducible from (data refs + config + seed). L3 stores the recipe — a few hundred bytes — and recomputes the artifact on demand instead of storing the bytes, trading cheap compute for expensive storage.

Wins on: quantized variants, derived exports, and reproducible checkpoints where storing terabytes is wasteful.

Where byte-level chunking struggles

We won't oversell it. Model weights are float tensors, and a single gradient step nudges almost every weight a little. The bytes change everywhere, not in localized regions — which is the opposite of what content-defined chunking is best at. So naive byte-level dedup between two consecutive checkpoints performs poorly.

Where L1 wins today: base model ↔ fine-tune, model ↔ quantized/LoRA variant, shared dataset shards, and append-style changes. Closing the consecutive-checkpoint gap needs tensor-aware chunking (diffing tensors in their own domain, not as raw bytes) — which is the next milestone on the engine roadmap, not a shipped claim.

See the numbers

Storage and bandwidth projections for real checkpoint and dataset workloads — clearly labeled measured vs. modeled.