Skip to main content
Dits for AI docs
Core Concepts

Three-Layer Addressing

Every other tool stores your data by the hash of its bytes. Dits adds two more ways to ask “have I seen this before?”

Almost every storage and version-control system addresses data the same way: hash the bytes, and use that hash as the address. It is simple, it is fast, and it is exactly the problem. Byte-identical is the ceiling. Two artifacts that are 99% the same but not byte-for-byte identical share nothing — they are stored twice, transferred twice, and reasoned about as if they were unrelated.

For AI data this ceiling is brutal. A fine-tune, a quantized variant, a LoRA adapter, the next training checkpoint — each is a small perturbation of something you already have, yet exact-match addressing treats every one as brand new. Dits attacks this with three layers of addressing, each one catching what the layer below it misses.

L1 — Exact (BLAKE3 of the bytes)

The foundation: every chunk is named by the BLAKE3 hash of its content. Same bytes produce the same address, so identical content is stored exactly once, and every read is verified against its hash — corruption is detected, never returned silently.

Status: working today. Content-addressed object store, FastCDC chunking, exact deduplication, and byte-exact reconstruction all ship in L1. This is the bedrock the other layers build on. See Content Addressing and Chunking & Deduplication.

Where it wins: shared dataset shards, a base model versus a fine-tune that left most tensors untouched, and any data that is appended to over time. Where it stops: anything that is nearly the same but not identical.

L2 — Similar (fingerprint, then store the delta)

L2 computes a perceptual or semantic fingerprint for a chunk, finds the nearest chunk already stored, and keeps only the delta against it. This catches near-duplicates that L1 cannot see: two quantizations of the same weights, an adapter that nudges a handful of layers, a dataset with a few thousand rows changed.

L3 — Derived (store the recipe, recompute the artifact)

Some artifacts do not need to be stored at all. If a file is the deterministic output of a known process — a quantization pass, a tokenization, a format conversion — L3 stores the recipe instead of the bytes and recomputes the artifact on demand. A derivable gigabyte costs zero bytes plus a recipe.

How the layers stack

  • L1 (today) — identical bytes free. The honest floor you can rely on right now.
  • L2 (roadmap) — near-identical bytes nearly free, via fingerprint plus delta.
  • L3 (roadmap) — derivable bytes free entirely, via stored recipe.

Each layer is a strictly larger net. L1 is the only one shipping today, and it already changes the economics of storing model and dataset lineage. L2 and L3 are where the addressing model is headed.