Skip to main content
Documentation

Content Addressing

Dits identifies all data by its cryptographic hash, ensuring integrity and enabling efficient deduplication across repositories.

What is Content Addressing?

In a content-addressed storage system, data is identified by what it contains rather than where it's located. The "address" of any piece of data is the cryptographic hash of its contents.

Traditional addressing (location-based):
  /projects/video/scene1_take2.mov

Content addressing (hash-based):
  blake3:a1b2c3d4e5f6...  (represents the exact bytes)

Benefits of Content Addressing

Data Integrity
If the hash matches, the data is guaranteed to be exactly what was stored. Any corruption or tampering is immediately detectable.
Automatic Deduplication
Identical content always has the same hash, so duplicates are automatically eliminated without any special handling.
Efficient Distribution
Content can be retrieved from any source that has it - the hash guarantees authenticity regardless of where it came from.
Immutability
Content-addressed data is inherently immutable. You can't change the data without changing the address, creating a clear audit trail.

BLAKE3: The Hash Function

Dits uses BLAKE3 for all content addressing. BLAKE3 is a modern cryptographic hash function designed for speed and security.

Why BLAKE3?

  • Extremely fast: 3-10x faster than SHA-256, essential for hashing large video files
  • Parallelizable: Can utilize all CPU cores for maximum throughput
  • Secure: Based on the proven BLAKE2 design, with no known vulnerabilities
  • Fixed output: Always produces a 256-bit (32-byte) hash
Performance comparison (10GB file):

SHA-256:   ~45 seconds
SHA-1:     ~30 seconds
BLAKE2b:   ~15 seconds
BLAKE3:    ~3 seconds  ← Dits uses this

BLAKE3 throughput: 3 GB/s per core (multi-threaded: 10+ GB/s)

Content-Addressed Objects

Dits uses content addressing for all objects in the repository:

Chunks

The smallest unit of storage. Each chunk's address is the BLAKE3 hash of its raw bytes.

Chunk {
  hash: blake3("raw bytes of chunk data"),
  size: 1048576,  // 1 MB
  data: [...raw bytes...]
}

Assets

An asset represents a file and contains an ordered list of chunk references. The asset's address is the hash of its metadata and chunk list.

Asset {
  hash: blake3(metadata + chunk_list),
  size: 10737418240,  // 10 GB
  chunks: [
    { hash: "a1b2c3...", offset: 0 },
    { hash: "d4e5f6...", offset: 1048576 },
    // ... more chunks
  ],
  metadata: {
    mime_type: "video/mp4",
    duration: 300.5,
    // ... codec info
  }
}

Manifests (Trees)

A manifest maps file paths to assets, representing a directory structure at a point in time.

Manifest {
  hash: blake3(sorted_entries),
  entries: {
    "footage/scene1.mov": { asset: "abc123...", mode: 0o644 },
    "footage/scene2.mov": { asset: "def456...", mode: 0o644 },
    "project.prproj":     { asset: "789xyz...", mode: 0o644 },
  }
}

Commits

A commit references a manifest (tree) and parent commits, creating the version history.

Commit {
  hash: blake3(all_fields),
  tree: "manifest_hash...",
  parents: ["parent_commit_hash..."],
  author: "Jane Editor <jane@example.com>",
  timestamp: "2024-01-15T10:30:00Z",
  message: "Add color grading to scene 1"
}

Hash Verification

Dits verifies hashes at multiple points to ensure data integrity:

  1. On write: When storing a chunk, the hash is computed and becomes the storage key
  2. On read: After reading a chunk, the hash is verified to match the expected value
  3. On transfer: During push/pull, hashes are verified to ensure data wasn't corrupted in transit
  4. On demand: The dits fsck command verifies all stored data

Content Addressing in Practice

Finding Duplicates

$ dits add footage/take1.mov footage/take2.mov

Chunking footage/take1.mov... 10,234 chunks
Chunking footage/take2.mov... 10,198 chunks
  → 8,547 chunks already exist (83% deduplicated)
  → 1,651 new chunks to store

Storage: 1.6 GB instead of 20 GB

Verifying Integrity

$ dits fsck

Checking 45,892 chunks...
Checking 1,234 assets...
Checking 89 commits...

All objects verified. No corruption detected.

Referencing Specific Content

# Check out a specific version of a file
$ dits show abc123def:footage/scene1.mov > old_scene1.mov

# The hash guarantees you get exactly what was stored
$ blake3sum old_scene1.mov
abc123def456...  old_scene1.mov  ✓

Security Considerations

Collision Resistance

BLAKE3 produces 256-bit hashes, meaning there are 2^256 possible hash values. The probability of two different pieces of data having the same hash (a collision) is astronomically small - effectively zero.

Pre-image Resistance

Given a hash, it's computationally infeasible to find any data that produces that hash. This protects against attacks where someone tries to create malicious content that appears legitimate.

Not Encryption

Next Steps