Similarity Dedup
Storing near-duplicate data once, by addressing chunks with a perceptual or semantic fingerprint instead of an exact hash.
Where exact hashing leaves value on the table
Exact content addressing (see Addressing) dedupes only byte-identical chunks. Flip a single bit and the BLAKE3 hash is completely different, so the chunk is stored again in full. For most code and text that is fine — content-defined chunking already isolates the edits.
AI datasets are different. They are saturated with near-duplicates: augmentations of the same image, web scrapes that re-host the same article, lightly edited prompts, resized or re-encoded media. These are almost the same, never identical, so exact-match hashing stores every variant at full size.
Address by fingerprint, not by exact hash
L2 addresses a chunk by a similarity fingerprint — a compact descriptor where near-duplicate inputs produce nearby descriptors:
- MinHash / SimHash for byte-ish and textual data, capturing token-set or shingle overlap.
- Perceptual hashes for media, robust to resize, re-encode, and minor edits.
- Embeddings for samples, placing semantically similar items close in vector space.
Anchor plus delta
On write, the fingerprint is looked up in an approximate nearest neighbor (ANN) index to find the closest chunk already stored — the anchor. If a close enough match exists, Dits stores only a small delta against that anchor instead of the whole chunk. The original is reconstructed by applying the delta to the anchor at read time.
incoming chunk
→ fingerprint (minhash / phash / embedding)
→ ANN index lookup
├─ no neighbor within threshold → store chunk in full (becomes a new anchor)
└─ neighbor found (anchor A) → store delta(A → chunk) ← small
read: chunk = apply(delta, anchor A)