Content Addressing
Dits identifies all data by its cryptographic hash, ensuring integrity and enabling efficient deduplication across repositories.
What is Content Addressing?
In a content-addressed storage system, data is identified by what it contains rather than where it's located. The "address" of any piece of data is the cryptographic hash of its contents.
Traditional addressing (location-based):
/projects/video/scene1_take2.mov
Content addressing (hash-based):
blake3:a1b2c3d4e5f6... (represents the exact bytes)Benefits of Content Addressing
BLAKE3: The Hash Function
Dits uses BLAKE3 for all content addressing. BLAKE3 is a modern cryptographic hash function designed for speed and security.
Why BLAKE3?
- Extremely fast: 3-10x faster than SHA-256, essential for hashing large video files
- Parallelizable: Can utilize all CPU cores for maximum throughput
- Secure: Based on the proven BLAKE2 design, with no known vulnerabilities
- Fixed output: Always produces a 256-bit (32-byte) hash
Performance comparison (10GB file):
SHA-256: ~45 seconds
SHA-1: ~30 seconds
BLAKE2b: ~15 seconds
BLAKE3: ~3 seconds ← Dits uses this
BLAKE3 throughput: 3 GB/s per core (multi-threaded: 10+ GB/s)Content-Addressed Objects
Dits uses content addressing for all objects in the repository:
Chunks
The smallest unit of storage. Each chunk's address is the BLAKE3 hash of its raw bytes.
Chunk {
hash: blake3("raw bytes of chunk data"),
size: 1048576, // 1 MB
data: [...raw bytes...]
}Assets
An asset represents a file and contains an ordered list of chunk references. The asset's address is the hash of its metadata and chunk list.
Asset {
hash: blake3(metadata + chunk_list),
size: 10737418240, // 10 GB
chunks: [
{ hash: "a1b2c3...", offset: 0 },
{ hash: "d4e5f6...", offset: 1048576 },
// ... more chunks
],
metadata: {
mime_type: "video/mp4",
duration: 300.5,
// ... codec info
}
}Manifests (Trees)
A manifest maps file paths to assets, representing a directory structure at a point in time.
Manifest {
hash: blake3(sorted_entries),
entries: {
"footage/scene1.mov": { asset: "abc123...", mode: 0o644 },
"footage/scene2.mov": { asset: "def456...", mode: 0o644 },
"project.prproj": { asset: "789xyz...", mode: 0o644 },
}
}Commits
A commit references a manifest (tree) and parent commits, creating the version history.
Commit {
hash: blake3(all_fields),
tree: "manifest_hash...",
parents: ["parent_commit_hash..."],
author: "Jane Editor <jane@example.com>",
timestamp: "2024-01-15T10:30:00Z",
message: "Add color grading to scene 1"
}Hash Verification
Dits verifies hashes at multiple points to ensure data integrity:
- On write: When storing a chunk, the hash is computed and becomes the storage key
- On read: After reading a chunk, the hash is verified to match the expected value
- On transfer: During push/pull, hashes are verified to ensure data wasn't corrupted in transit
- On demand: The
dits fsckcommand verifies all stored data
Automatic Corruption Detection
Content Addressing in Practice
Finding Duplicates
$ dits add footage/take1.mov footage/take2.mov
Chunking footage/take1.mov... 10,234 chunks
Chunking footage/take2.mov... 10,198 chunks
→ 8,547 chunks already exist (83% deduplicated)
→ 1,651 new chunks to store
Storage: 1.6 GB instead of 20 GBVerifying Integrity
$ dits fsck
Checking 45,892 chunks...
Checking 1,234 assets...
Checking 89 commits...
All objects verified. No corruption detected.Referencing Specific Content
# Check out a specific version of a file
$ dits show abc123def:footage/scene1.mov > old_scene1.mov
# The hash guarantees you get exactly what was stored
$ blake3sum old_scene1.mov
abc123def456... old_scene1.mov ✓Security Considerations
Collision Resistance
BLAKE3 produces 256-bit hashes, meaning there are 2^256 possible hash values. The probability of two different pieces of data having the same hash (a collision) is astronomically small - effectively zero.
Pre-image Resistance
Given a hash, it's computationally infeasible to find any data that produces that hash. This protects against attacks where someone tries to create malicious content that appears legitimate.
Not Encryption
Important
Next Steps
- Learn about Repositories
- Understand Commits & History
- Explore Chunking & Deduplication