Skip to main content
An alternative toGit LFS · Xet · DVC

Version control for models, datasets & research data

Model checkpoints, training sets, genomic reads, simulation output—the heaviest, least version-controlled data in AI and science. Dits content-addresses every chunk, so you store only what changed, move deltas instead of whole files, and keep an honest history.

Open coreSelf-hostableWeights never leave your infra

Why does this exist?

Because the biggest files in AI and science are still versioned by renaming them. Here's the whole story in three steps.

The problem
ckpt-step-1000.safetensors
ckpt-step-1500.safetensors
ckpt-best.safetensors
model-final-v3.safetensors

Every checkpoint is another full multi-gigabyte copy in a bucket. No diff, no lineage, no answer to “which weights actually shipped?”

Why buckets can't help

S3 and Git LFS store a whole new copy on every write. They have no idea that two checkpoints, a base model and its fine-tune, or two dataset snapshots are 99% the same.

So you pay full storage and full bandwidth for changes that are a fraction of the bytes.

What Dits does

Content-addresses every chunk. Shared structure across checkpoints, variants, and shards is stored once—and you get real commits, diffs, and lineage over it.

Then it goes further than exact dedup: similarity and recipes.

“Git lets developers version code and only save what changed. Dits does that for weights, datasets, and scientific data.

Same engine as Dits for media. Different heaviest files.

Address data three ways.

Not just by the hash of its bytes.

Working today
L1 · Exact

Content-addressed bytes

Every chunk is addressed by its BLAKE3 hash. Identical chunks across checkpoints, shards, and variants are stored exactly once, and every read is integrity-verified.

Roadmap
L2 · Similar

Similarity-addressed

Near-duplicates — augmented images, re-encodes, lightly-edited samples — addressed by a perceptual/semantic fingerprint and stored as a small delta against the closest existing chunk. The thing exact-match hashing misses entirely.

Roadmap
L3 · Derived

Derivation-addressed

A quantized model, a LoRA, or a checkpoint that is reproducible from (data refs + config + seed) is stored as its recipe — a few hundred bytes — and recomputed on demand instead of stored as terabytes.

L1 ships today on the open engine. L2 and L3 are the roadmap that takes Dits past exact-match dedup—the ceiling every other tool is built on. See the honest breakdown →

Diff your checkpoints.

Not just store more of them.

See what each step actually changed

Dits splits artifacts into content-defined chunks. Between a base model and a fine-tune—or across shared dataset shards—only the changed chunks are stored. Track lineage, compare runs, and understand your storage at a glance.

  • BLAKE3-verified, byte-exact reconstruction
  • Dedup across variants, shards, and runs
  • Full commit history over multi-GB artifacts
llama-ft-step-2000.safetensors26 GB
33 chunks reused7 new chunks (4.5 GB)

Move less data.

Across nodes, regions, and registries.

terminal
$ dits push registry main
Analyzing checkpoint...
→ 26 GB logical (1 artifact)
→ 7 new chunks identified
→ Uploading 4.5 GB (83% deduplicated)

✓ Pushed delta in 11s
Sync engine on the roadmap

Delta sync, not full re-upload

Shuttling checkpoints between training nodes, registries, and regions is pure bandwidth tax when 90% of the bytes already exist on the other side. Because every chunk is content-addressed, sync transfers only the difference—and resumes where it dropped. (The networked sync engine is in active development; the content-addressed store it builds on works today.)

Full re-upload

26 GB

With Dits

4.5 GB

How it compares

CapabilityGit LFSXet / DVCDits
Stores only changed chunks
Dedupe across checkpoints & variants
Integrity-verified reads (BLAKE3)partialpartial
Similarity dedup (near-duplicates)roadmap
Recipe / recompute instead of storeroadmap
Open core, self-hostablepartial

“Roadmap” marks capabilities in active design, not shipped today.

Frequently asked questions

Early, and honest about it

3 of 7 phases complete — the content engine works today; tensor-aware dedup, sync, and recompute are next.

Content storeExact dedupLocal historyTensor-aware chunkingNetwork syncSimilarity dedupDerivation / recompute
Follow on GitHub

Bring version control to your heaviest data

One open engine for the heaviest data in AI and science. Self-hostable, content-addressed, and built to go past exact-match dedup.