Version control for models, datasets & research data
Model checkpoints, training sets, genomic reads, simulation output—the heaviest, least version-controlled data in AI and science. Dits content-addresses every chunk, so you store only what changed, move deltas instead of whole files, and keep an honest history.
Why does this exist?
Because the biggest files in AI and science are still versioned by renaming them. Here's the whole story in three steps.
Address data three ways.
Not just by the hash of its bytes.
L1 ships today on the open engine. L2 and L3 are the roadmap that takes Dits past exact-match dedup—the ceiling every other tool is built on. See the honest breakdown →
Diff your checkpoints.
Not just store more of them.
See what each step actually changed
Dits splits artifacts into content-defined chunks. Between a base model and a fine-tune—or across shared dataset shards—only the changed chunks are stored. Track lineage, compare runs, and understand your storage at a glance.
- BLAKE3-verified, byte-exact reconstruction
- Dedup across variants, shards, and runs
- Full commit history over multi-GB artifacts
Move less data.
Across nodes, regions, and registries.
→ 26 GB logical (1 artifact)
→ 7 new chunks identified
→ Uploading 4.5 GB (83% deduplicated)
✓ Pushed delta in 11s
Delta sync, not full re-upload
Shuttling checkpoints between training nodes, registries, and regions is pure bandwidth tax when 90% of the bytes already exist on the other side. Because every chunk is content-addressed, sync transfers only the difference—and resumes where it dropped. (The networked sync engine is in active development; the content-addressed store it builds on works today.)
Full re-upload
26 GB
With Dits
4.5 GB
How it compares
“Roadmap” marks capabilities in active design, not shipped today.
Early, and honest about it
3 of 7 phases complete — the content engine works today; tensor-aware dedup, sync, and recompute are next.
Bring version control to your heaviest data
One open engine for the heaviest data in AI and science. Self-hostable, content-addressed, and built to go past exact-match dedup.