Core Concepts

Chunking & Deduplication

Dits splits files where the content says to, not on a fixed grid — so an edit changes a few chunks instead of all of them.

Before anything gets a content address, a file has to be split into pieces. How you split matters enormously. Fixed-size blocks seem obvious, but they have a fatal flaw for versioned data: insert a single byte near the front and every block boundary after it shifts, so every downstream block changes and nothing dedups.

Content-defined chunking with FastCDC

Dits uses FastCDC, a content-defined chunker. Instead of cutting every N bytes, it slides a rolling hash across the data and places a boundary wherever the content hits a chosen pattern. Boundaries are anchored to the bytes around them, not to absolute offsets.

The payoff is boundary-shift resilience. Insert or append data and only the chunks near the change are affected — the boundaries before and after re-sync to the same content-defined cut points, so the surrounding chunks keep their old addresses and dedup against what you already stored.

# Fixed-size blocks: insert near the start shifts every later boundary
original:  [AAAA][BBBB][CCCC][DDDD]
+1 byte:   [xAAA][ABBB][BCCC][CDDD]   # all 4 blocks now differ

# FastCDC: boundaries follow content, so they re-sync after the edit
original:  [AAAA]|[BBBB]|[CCCC]|[DDDD]
+1 byte:   [xAAAA]|[BBBB]|[CCCC]|[DDDD]   # only the first chunk differs

Where exact dedup wins for AI today

Combined with content addressing, FastCDC chunks that did not change are stored exactly once. This is real, shipping behavior (layer L1 of Three-Layer Addressing), and it pays off whenever artifacts share large unchanged regions:

Base model vs. fine-tune — a fine-tune that touches some tensors and leaves the rest byte-identical shares every untouched chunk with the base.
Model vs. quantized or LoRA variant — where regions are preserved verbatim, those chunks dedup instead of duplicating.
Shared dataset shards — datasets that reuse the same shards across versions store each shared shard once.
Appended data — logs, growing corpora, and checkpointed datasets where new data is added at the end keep all the earlier chunks.

Warning

The weak spot: consecutive training checkpoints. Between two adjacent checkpoints, the optimizer nudges floats throughout the weight tensors. Those tiny, diffuse updates change bytes nearly everywhere, so byte-level FastCDC finds almost no shared chunks and dedup collapses to near zero. The roadmap fix is Tensor-Aware Chunking, which aligns boundaries to tensor structure instead of byte patterns.

Note

The chunking algorithm here is the same one in the engine. For the full reference — parameters, boundary masks, and the rolling-hash details — see the engine reference.