Skip to main content
Documentation

Chunking & Deduplication

Dits uses content-defined chunking (CDC) to break files into variable-size pieces, enabling efficient storage and transfer of large binary files.

Why Chunking Matters

Traditional version control systems like Git treat each file as a single unit. When you modify a 10GB video file, Git stores an entirely new copy, even if you only changed a few seconds of footage.

Dits takes a different approach: it breaks files into smaller chunks (typically 256KB to 4MB) and only stores unique chunks. This means:

65% Less Storage
Similar files share chunks, dramatically reducing total storage needs for projects with multiple takes or versions.
Faster Transfers
Only missing chunks need to be transferred. Pushing a small edit to a large file takes seconds, not minutes.
Efficient Diffs
Changes are localized to affected chunks, making it easy to see exactly what changed between versions.

Content-Defined Chunking (CDC)

Dits implements multiple content-defined chunking algorithms optimized for different use cases. The default is FastCDC, but alternatives are available for specific performance or security requirements.

Available Chunking Algorithms

FastCDC (Default)
Recommended
High-performance content-defined chunking. Excellent deduplication with good locality properties. Best for general use.
Throughput: 2 GB/s
Deduplication: Excellent
Memory: Low
Rabin Fingerprinting
Classic polynomial rolling hash. Strong locality guarantees but may have higher chunk size variance.
Throughput: 1.5 GB/s
Deduplication: Good
Locality: Excellent
Asymmetric Extremum
Places boundaries at local minima/maxima. Better chunk size control with reduced variance.
Throughput: 1.8 GB/s
Size Variance: Low
Metadata: Minimal
Chonkers Algorithm
Advanced
Layered algorithm with provable strict guarantees on chunk size and edit locality. Mission-critical reliability.
Guarantees: Strict
Throughput: 1.2 GB/s
Complexity: High
Parallel FastCDC
Performance
Multi-core FastCDC implementation. 2-4x throughput improvement on modern multi-core systems.
Throughput: 4-8 GB/s
Scalability: Linear
Large Files: Optimal
Keyed FastCDC
Security
Privacy-enhanced FastCDC that prevents fingerprinting attacks by incorporating a secret key.
Privacy: Protected
Performance: Same as FastCDC
Security: Anti-fingerprinting

FastCDC Algorithm Details

FastCDC is Dits' primary chunking algorithm, providing an optimal balance of performance and deduplication effectiveness.

How It Works

  1. Rolling Hash: A sliding window moves through the file, computing a rolling hash at each position.
  2. Boundary Detection: When the hash matches a specific pattern (determined by a bitmask), a chunk boundary is created.
  3. Size Constraints: Chunks are constrained to be between minimum and maximum sizes to ensure consistent behavior.
// Dits chunking parameters
Minimum chunk size:  256 KB
Average chunk size:    1 MB
Maximum chunk size:    4 MB

// Example: 10GB video file
Original file:     10 GB (1 file)
After chunking:    ~10,000 chunks
Average chunk:     ~1 MB each

Why Content-Defined?

The key advantage of CDC over fixed-size chunking is shift resistance. Consider what happens when you insert data at the beginning of a file:

Fixed-Size Chunking

Insert X at start → All chunks shift

AAAA
BBBB
CCCC
DDDD
ALL chunks changed! 0% reuse

Content-Defined Chunking

Insert X at start → Only 1 new chunk

X
AAA
BBBBB
CC
DDDDD
Only 1 new chunk! 80%+ reuse

Video-Aware Chunking

For video files, Dits goes beyond basic CDC by aligning chunk boundaries to video keyframes (I-frames) when possible.

I
P
P
P
I
P
P
P
I
Chunk 1
Chunk 2
...
I
Keyframe (complete image)
P
Predicted frame
B
Bidirectional frame

Deduplication

After chunking, each chunk is hashed using a cryptographic hash function (BLAKE3 by default). Chunks with identical hashes are stored only once, regardless of which files they came from.

Hash Algorithm Options

Dits supports multiple hash algorithms for different performance and security requirements:

BLAKE3

Default - Optimized for speed

  • 3+ GB/s per core (multi-threaded: 10+ GB/s)
  • Multi-threaded
  • Cryptographically secure

SHA-256

Industry standard

  • ~500 MB/s throughput
  • Maximum compatibility
  • Widely analyzed

SHA-3-256

Future-proof

  • ~300 MB/s throughput
  • Different algorithm family
  • Post-quantum resistant

Where Deduplication Helps

  • Multiple takes: Similar shots from the same scene share most of their chunks
  • Version history: Editing a video only creates new chunks for the changed portions
  • Cross-project: Stock footage used in multiple projects is stored once
  • Duplicated files: Copies of the same file share 100% of their chunks

Real-World Example

Project: 3 takes of a 2-minute scene (1080p ProRes)

Without deduplication:
  Take 1:  12 GB
  Take 2:  12 GB
  Take 3:  12 GB
  Total:   36 GB

With Dits chunking:
  Take 1:  12 GB (all new chunks)
  Take 2:   2 GB (83% shared with Take 1)
  Take 3:   1 GB (92% shared with previous)
  Total:   15 GB (58% savings)

Choosing a Chunking Algorithm

Different algorithms work better for different scenarios:

Use CaseRecommended AlgorithmWhy
General purposeFastCDCBest balance of performance and deduplication
Large files (>1GB)Parallel FastCDCLinear scaling with CPU cores
Mission-critical dataChonkersProvable guarantees on size and locality
Privacy-sensitive dataKeyed FastCDCPrevents fingerprinting attacks
Consistent chunk sizesAsymmetric ExtremumMinimal size variance
Legacy compatibilityRabinClassic algorithm with strong locality

Chunk Parameters

Dits uses carefully tuned parameters for different file types:

File TypeMin SizeAvg SizeMax SizeNotes
Video (H.264/H.265)256 KB1 MB4 MBKeyframe-aligned
Video (ProRes/DNxHR)512 KB2 MB8 MBLarger for efficiency
Audio64 KB256 KB1 MBSmaller for precision
Images64 KB512 KB2 MBStandard CDC
Other256 KB1 MB4 MBDefault parameters

Technical Details

The FastCDC Algorithm

FastCDC improves on the original CDC algorithm by using a gear-based rolling hash that is significantly faster while maintaining good deduplication properties.

// Simplified FastCDC implementation
fn find_chunk_boundary(data: &[u8]) -> usize {
    let mut hash: u64 = 0;
    let mask = (1 << 20) - 1;  // For ~1MB average

    for (i, &byte) in data.iter().enumerate() {
        hash = (hash << 1) + GEAR_TABLE[byte as usize];

        if i >= MIN_SIZE && (hash & mask) == 0 {
            return i;  // Found boundary
        }
        if i >= MAX_SIZE {
            return i;  // Force boundary at max size
        }
    }
    data.len()  // End of data
}

BLAKE3 Hashing

Each chunk is identified by its BLAKE3 hash, a 256-bit cryptographic hash that provides:

  • Speed: 3-10x faster than SHA-256
  • Security: Cryptographically secure collision resistance
  • Parallelism: Designed for multi-core processing
  • Streaming: Can hash data incrementally

Next Steps