Chunking & Deduplication
Dits splits files where the content says to, not on a fixed grid — so an edit changes a few chunks instead of all of them.
Before anything gets a content address, a file has to be split into pieces. How you split matters enormously. Fixed-size blocks seem obvious, but they have a fatal flaw for versioned data: insert a single byte near the front and every block boundary after it shifts, so every downstream block changes and nothing dedups.
Content-defined chunking with FastCDC
Dits uses FastCDC, a content-defined chunker. Instead of cutting every N bytes, it slides a rolling hash across the data and places a boundary wherever the content hits a chosen pattern. Boundaries are anchored to the bytes around them, not to absolute offsets.
The payoff is boundary-shift resilience. Insert or append data and only the chunks near the change are affected — the boundaries before and after re-sync to the same content-defined cut points, so the surrounding chunks keep their old addresses and dedup against what you already stored.
# Fixed-size blocks: insert near the start shifts every later boundary
original: [AAAA][BBBB][CCCC][DDDD]
+1 byte: [xAAA][ABBB][BCCC][CDDD] # all 4 blocks now differ
# FastCDC: boundaries follow content, so they re-sync after the edit
original: [AAAA]|[BBBB]|[CCCC]|[DDDD]
+1 byte: [xAAAA]|[BBBB]|[CCCC]|[DDDD] # only the first chunk differsWhere exact dedup wins for AI today
Combined with content addressing, FastCDC chunks that did not change are stored exactly once. This is real, shipping behavior (layer L1 of Three-Layer Addressing), and it pays off whenever artifacts share large unchanged regions:
- Base model vs. fine-tune — a fine-tune that touches some tensors and leaves the rest byte-identical shares every untouched chunk with the base.
- Model vs. quantized or LoRA variant — where regions are preserved verbatim, those chunks dedup instead of duplicating.
- Shared dataset shards — datasets that reuse the same shards across versions store each shared shard once.
- Appended data — logs, growing corpora, and checkpointed datasets where new data is added at the end keep all the earlier chunks.