Core Concepts

Similarity Dedup

Storing near-duplicate data once, by addressing chunks with a perceptual or semantic fingerprint instead of an exact hash.

Warning

This is a roadmap and design topic, not a shipped feature. Today only L1 ships: exact content addressing, FastCDC deduplication, BLAKE3 hashing, and local history. Similarity-addressing (L2) is described here as planned design.

Where exact hashing leaves value on the table

Exact content addressing (see Addressing) dedupes only byte-identical chunks. Flip a single bit and the BLAKE3 hash is completely different, so the chunk is stored again in full. For most code and text that is fine — content-defined chunking already isolates the edits.

AI datasets are different. They are saturated with near-duplicates: augmentations of the same image, web scrapes that re-host the same article, lightly edited prompts, resized or re-encoded media. These are almost the same, never identical, so exact-match hashing stores every variant at full size.

Address by fingerprint, not by exact hash

L2 addresses a chunk by a similarity fingerprint — a compact descriptor where near-duplicate inputs produce nearby descriptors:

MinHash / SimHash for byte-ish and textual data, capturing token-set or shingle overlap.
Perceptual hashes for media, robust to resize, re-encode, and minor edits.
Embeddings for samples, placing semantically similar items close in vector space.

Anchor plus delta

On write, the fingerprint is looked up in an approximate nearest neighbor (ANN) index to find the closest chunk already stored — the anchor. If a close enough match exists, Dits stores only a small delta against that anchor instead of the whole chunk. The original is reconstructed by applying the delta to the anchor at read time.

incoming chunk
  → fingerprint  (minhash / phash / embedding)
  → ANN index lookup
      ├─ no neighbor within threshold → store chunk in full (becomes a new anchor)
      └─ neighbor found (anchor A)    → store delta(A → chunk)  ← small
read: chunk = apply(delta, anchor A)

Note

The ANN index is approximate on purpose: an exact nearest-neighbor search over billions of chunks is intractable, and we only need a good enough anchor for the delta to be small. A missed match just means storing one extra full chunk — never a correctness bug, since reconstruction is always verified against an exact hash.

Tip

Related directions: tensor-domain diffs in Tensor-Aware Chunking and recompute-instead-of-store in Derivation & Recompute. See the Roadmap for sequencing and the How it works overview.