Chunking & Deduplication
Dits uses content-defined chunking (CDC) to break files into variable-size pieces, enabling efficient storage and transfer of large binary files.
Why Chunking Matters
Traditional version control systems like Git treat each file as a single unit. When you modify a 10GB video file, Git stores an entirely new copy, even if you only changed a few seconds of footage.
Dits takes a different approach: it breaks files into smaller chunks (typically 256KB to 4MB) and only stores unique chunks. This means:
Content-Defined Chunking (CDC)
Dits implements multiple content-defined chunking algorithms optimized for different use cases. The default is FastCDC, but alternatives are available for specific performance or security requirements.
Available Chunking Algorithms
FastCDC Algorithm Details
FastCDC is Dits' primary chunking algorithm, providing an optimal balance of performance and deduplication effectiveness.
How It Works
- Rolling Hash: A sliding window moves through the file, computing a rolling hash at each position.
- Boundary Detection: When the hash matches a specific pattern (determined by a bitmask), a chunk boundary is created.
- Size Constraints: Chunks are constrained to be between minimum and maximum sizes to ensure consistent behavior.
// Dits chunking parameters
Minimum chunk size: 256 KB
Average chunk size: 1 MB
Maximum chunk size: 4 MB
// Example: 10GB video file
Original file: 10 GB (1 file)
After chunking: ~10,000 chunks
Average chunk: ~1 MB eachWhy Content-Defined?
The key advantage of CDC over fixed-size chunking is shift resistance. Consider what happens when you insert data at the beginning of a file:
Fixed-Size Chunking
Insert X at start → All chunks shift
Content-Defined Chunking
Insert X at start → Only 1 new chunk
Video-Aware Chunking
For video files, Dits goes beyond basic CDC by aligning chunk boundaries to video keyframes (I-frames) when possible.
Keyframe Alignment
Deduplication
After chunking, each chunk is hashed using a cryptographic hash function (BLAKE3 by default). Chunks with identical hashes are stored only once, regardless of which files they came from.
Hash Algorithm Options
Dits supports multiple hash algorithms for different performance and security requirements:
BLAKE3
Default - Optimized for speed
- 3+ GB/s per core (multi-threaded: 10+ GB/s)
- Multi-threaded
- Cryptographically secure
SHA-256
Industry standard
- ~500 MB/s throughput
- Maximum compatibility
- Widely analyzed
SHA-3-256
Future-proof
- ~300 MB/s throughput
- Different algorithm family
- Post-quantum resistant
Where Deduplication Helps
- Multiple takes: Similar shots from the same scene share most of their chunks
- Version history: Editing a video only creates new chunks for the changed portions
- Cross-project: Stock footage used in multiple projects is stored once
- Duplicated files: Copies of the same file share 100% of their chunks
Real-World Example
Project: 3 takes of a 2-minute scene (1080p ProRes)
Without deduplication:
Take 1: 12 GB
Take 2: 12 GB
Take 3: 12 GB
Total: 36 GB
With Dits chunking:
Take 1: 12 GB (all new chunks)
Take 2: 2 GB (83% shared with Take 1)
Take 3: 1 GB (92% shared with previous)
Total: 15 GB (58% savings)Choosing a Chunking Algorithm
Different algorithms work better for different scenarios:
| Use Case | Recommended Algorithm | Why |
|---|---|---|
| General purpose | FastCDC | Best balance of performance and deduplication |
| Large files (>1GB) | Parallel FastCDC | Linear scaling with CPU cores |
| Mission-critical data | Chonkers | Provable guarantees on size and locality |
| Privacy-sensitive data | Keyed FastCDC | Prevents fingerprinting attacks |
| Consistent chunk sizes | Asymmetric Extremum | Minimal size variance |
| Legacy compatibility | Rabin | Classic algorithm with strong locality |
Chunk Parameters
Dits uses carefully tuned parameters for different file types:
| File Type | Min Size | Avg Size | Max Size | Notes |
|---|---|---|---|---|
| Video (H.264/H.265) | 256 KB | 1 MB | 4 MB | Keyframe-aligned |
| Video (ProRes/DNxHR) | 512 KB | 2 MB | 8 MB | Larger for efficiency |
| Audio | 64 KB | 256 KB | 1 MB | Smaller for precision |
| Images | 64 KB | 512 KB | 2 MB | Standard CDC |
| Other | 256 KB | 1 MB | 4 MB | Default parameters |
Technical Details
The FastCDC Algorithm
FastCDC improves on the original CDC algorithm by using a gear-based rolling hash that is significantly faster while maintaining good deduplication properties.
// Simplified FastCDC implementation
fn find_chunk_boundary(data: &[u8]) -> usize {
let mut hash: u64 = 0;
let mask = (1 << 20) - 1; // For ~1MB average
for (i, &byte) in data.iter().enumerate() {
hash = (hash << 1) + GEAR_TABLE[byte as usize];
if i >= MIN_SIZE && (hash & mask) == 0 {
return i; // Found boundary
}
if i >= MAX_SIZE {
return i; // Force boundary at max size
}
}
data.len() // End of data
}BLAKE3 Hashing
Each chunk is identified by its BLAKE3 hash, a 256-bit cryptographic hash that provides:
- Speed: 3-10x faster than SHA-256
- Security: Cryptographically secure collision resistance
- Parallelism: Designed for multi-core processing
- Streaming: Can hash data incrementally
Next Steps
- Learn about Content Addressing
- Understand Repositories
- Explore Video Features