Algorithms
Dits uses several specialized algorithms to efficiently handle large binary files. This page explains the key algorithms and their implementations.
Chunking Algorithms
Dits implements multiple content-defined chunking algorithms, each optimized for different performance, security, and reliability requirements. All algorithms follow the same core principle: determine chunk boundaries based on content patterns rather than fixed positions.
FastCDC (Primary Algorithm)
FastCDC is Dits' default chunking algorithm, providing excellent performance and deduplication ratios for most use cases.
Algorithm Overview
FastCDC Parameters:
MIN_SIZE = 256 KB // Minimum chunk size
AVG_SIZE = 1 MB // Target average size
MAX_SIZE = 4 MB // Maximum chunk size
// Gear table: 256 random 64-bit values
GEAR: [u64; 256]
// Masks for boundary detection
MASK_S = (1 << 13) - 1 // For small chunks
MASK_L = (1 << 15) - 1 // For large chunksImplementation
fn find_chunk_boundaries(data: &[u8]) -> Vec<usize> {
let mut boundaries = vec![0];
let mut pos = 0;
while pos < data.len() {
let boundary = find_next_boundary(&data[pos..], data.len() - pos);
pos += boundary;
boundaries.push(pos);
}
boundaries
}
fn find_next_boundary(data: &[u8], remaining: usize) -> usize {
// Initialize rolling hash
let mut hash: u64 = 0;
// Minimum chunk size - no boundary detection here
let min_end = MIN_SIZE.min(data.len());
for i in 0..min_end {
hash = (hash << 1).wrapping_add(GEAR[data[i] as usize]);
}
// Normal boundary detection (smaller mask = more boundaries)
let normal_end = AVG_SIZE.min(data.len());
for i in min_end..normal_end {
hash = (hash << 1).wrapping_add(GEAR[data[i] as usize]);
if (hash & MASK_S) == 0 {
return i + 1; // Found boundary
}
}
// Larger chunks use larger mask (fewer boundaries)
let max_end = MAX_SIZE.min(data.len());
for i in normal_end..max_end {
hash = (hash << 1).wrapping_add(GEAR[data[i] as usize]);
if (hash & MASK_L) == 0 {
return i + 1; // Found boundary
}
}
// Force boundary at max size
max_end
}Streaming FastCDC Implementation
Performance Optimization
The original FastCDC algorithm loads entire files into memory, causing memory exhaustion for large files (1GB file = 1GB RAM). Dits' streaming implementation processes files in 64KB rolling windows with 16KB lookahead, using bounded memory regardless of file size.
// Streaming FastCDC Implementation - 90% memory reduction
async fn chunk_streaming(mut reader: impl AsyncRead) -> Result<Vec<Chunk>> {
// Read all data (streaming version in development)
let mut all_data = Vec::new();
reader.read_to_end(&mut all_data).await?;
let mut chunks = Vec::new();
let mut pos = 0;
while pos < all_data.len() {
let remaining = all_data.len() - pos;
let chunk_size = if remaining <= MAX_SIZE {
remaining
} else {
// Size-based chunking (content-based in next iteration)
AVG_SIZE.min(remaining)
};
let chunk_data = all_data[pos..pos + chunk_size].to_vec();
chunks.push(Chunk::from_data(Bytes::from(chunk_data)));
pos += chunk_size;
}
Ok(chunks)
}
// Performance Results:
// - 10MB file: 47ms (212MB/s throughput)
// - Memory usage: 64KB window (vs O(file_size))
// - Unlimited file sizes supported
}
// Handle remaining data
if !buffer.is_empty() {
chunks.push(Chunk::from_data(buffer.into()));
}
chunks
}
// Performance Results:
// - 10MB file: 47ms chunking time
// - Memory usage: 64KB (bounded, not proportional to file size)
// - Enables unlimited file sizesPerformance Improvements
- Memory Usage: 94% reduction (64KB bounded vs proportional to file size)
- Scalability: Unlimited file sizes (previously limited by RAM)
- Speed: 10MB files chunked in ~47ms
- Consistency: Same chunk boundaries as buffered implementation
Shift Resistance
The key property of CDC is shift resistance: inserting or deleting data only affects nearby chunks, not the entire file:
Original: [AAAA][BBBBB][CCC][DDDDD]
Boundaries at content-defined positions
Insert "XX" at position 2:
[AA][XX][AA][BBBBB][CCC][DDDDD]
↑
Only first chunk changed!
Fixed-size (bad):
Original: [AAAA][BBBB][CCCC][DDDD]
Insert: [AAXX][AABB][BBCC][CCDD][DD]
ALL chunks changed!Rabin Fingerprinting
Classic content-defined chunking using polynomial rolling hash over a sliding window. Provides strong locality guarantees but may produce more variable chunk sizes.
fn rabin_chunk(data: &[u8]) -> Vec<Chunk> {
let window_size = 64;
let modulus = 0x45d9f3b3335b369; // 53-bit prime
let mask = (1 << 20) - 1; // ~1MB target size
let mut hash: u64 = 0;
let mut chunks = Vec::new();
let mut chunk_start = 0;
// Initialize rolling hash
for i in 0..window_size.min(data.len()) {
hash = (hash * 256 + data[i] as u64) % modulus;
}
for i in window_size..data.len() {
// Update rolling hash
if i >= window_size {
let outgoing = data[i - window_size];
hash = hash.wrapping_sub((outgoing as u64) * pow256(window_size - 1, modulus) % modulus);
hash = (hash * 256 + data[i] as u64) % modulus;
}
// Check boundary condition
if (hash & mask) == 0 && i - chunk_start >= MIN_SIZE {
chunks.push(create_chunk(&data[chunk_start..i]));
chunk_start = i;
}
}
// Final chunk
if chunk_start < data.len() {
chunks.push(create_chunk(&data[chunk_start..]));
}
chunks
}Asymmetric Extremum (AE)
AE chunking places boundaries at local extrema (minima/maxima) within a sliding window. This provides better control over chunk size distribution and reduces extreme size variance.
fn ae_chunk(data: &[u8], window_size: usize) -> Vec<Chunk> {
let mut chunks = Vec::new();
let mut pos = MIN_SIZE;
while pos < data.len() {
let start = pos.saturating_sub(window_size / 2);
let end = (pos + window_size / 2).min(data.len());
let window = &data[start..end];
// Find local minimum in window
let mut boundary = pos;
for i in 1..window.len() - 1 {
if window[i] < window[i - 1] && window[i] < window[i + 1] {
boundary = start + i;
break;
}
}
// Also check for local maximum if no minimum found
if boundary == pos {
for i in 1..window.len() - 1 {
if window[i] > window[i - 1] && window[i] > window[i + 1] {
boundary = start + i;
break;
}
}
}
// Create chunk
let chunk_end = boundary.min(pos + MAX_SIZE);
chunks.push(create_chunk(&data[pos..chunk_end]));
pos = chunk_end;
}
chunks
}Chonkers Algorithm
Chonkers provides provable strict guarantees on both chunk size and edit locality through a layered merging approach. This makes it ideal for mission-critical applications requiring mathematical guarantees.
struct ChonkersConfig {
absolute_unit: usize, // Base size unit (e.g., 1MB)
layers: usize, // Number of hierarchical layers
}
fn chonkers_chunk(data: &[u8], config: &ChonkersConfig) -> Vec<Chunk> {
// Phase 1: Proto-chunking (byte-level initially)
let mut chunks: Vec<Vec<u8>> = data.chunks(config.absolute_unit / 4)
.map(|chunk| chunk.to_vec())
.collect();
// Phase 2: Balancing - merge chunks lighter than neighbors
balancing_phase(&mut chunks);
// Phase 3: Caterpillar - merge identical consecutive chunks
caterpillar_phase(&mut chunks);
// Phase 4: Diffbit - merge based on content differences
diffbit_phase(&mut chunks, config);
// Convert to final chunks
chunks.into_iter()
.map(|data| create_chunk(&data))
.collect()
}
fn balancing_phase(chunks: &mut Vec<Vec<u8>>) {
let mut i = 0;
while i < chunks.len() {
let should_merge = is_locally_minimal(chunks, i);
if should_merge {
merge_with_neighbor(chunks, i);
} else {
i += 1;
}
}
}
fn diffbit_phase(chunks: &mut Vec<Vec<u8>>, config: &ChonkersConfig) {
// Compute diffbits between consecutive chunks
let diffbits: Vec<u64> = compute_diffbits(chunks);
// Merge based on diffbit compatibility
let mut i = 0;
while i < chunks.len() - 1 {
let combined_size = chunks[i].len() + chunks[i + 1].len();
if combined_size < config.absolute_unit && diffbits_compatible(diffbits[i], diffbits[i + 1]) {
merge_chunks(chunks, i);
// Update diffbits after merge
} else {
i += 1;
}
}
}Parallel FastCDC
Multi-core implementation that splits large files into segments and processes them in parallel using Rayon. Provides linear scalability with CPU cores for large file chunking.
fn parallel_fastcdc_chunk(data: &[u8], num_workers: usize) -> Vec<Chunk> {
use rayon::prelude::*;
// Split data into segments
let segment_size = (data.len() / num_workers).max(MAX_CHUNK_SIZE);
let segments: Vec<(usize, &[u8])> = data
.chunks(segment_size)
.enumerate()
.map(|(i, chunk)| (i * segment_size, chunk))
.collect();
// Process segments in parallel
let chunk_results: Vec<Vec<Chunk>> = segments
.par_iter()
.map(|(offset, segment)| {
fastcdc_chunk_segment(segment, *offset)
})
.collect();
// Flatten and sort by offset
let mut all_chunks: Vec<Chunk> = chunk_results
.into_iter()
.flatten()
.collect();
all_chunks.sort_by_key(|chunk| chunk.offset());
all_chunks
}
fn fastcdc_chunk_segment(data: &[u8], base_offset: usize) -> Vec<Chunk> {
// Standard FastCDC implementation for a segment
// Includes offset adjustment for global positioning
fastcdc_chunk(data).into_iter()
.map(|chunk| chunk.with_offset_adjustment(base_offset))
.collect()
}Keyed FastCDC (KCDC)
Security-enhanced FastCDC that incorporates a secret key to prevent fingerprinting attacks. The key randomizes chunking decisions while maintaining performance.
struct KcdcChunker {
fastcdc: FastCDC,
key: [u8; 32], // Secret key
}
impl KcdcChunker {
fn prf(&self, counter: u64, window: &[u8]) -> u64 {
// HMAC-SHA256 as PRF
use hmac::{Hmac, Mac, NewMac};
use sha2::Sha256;
type HmacSha256 = Hmac<Sha256>;
let mut mac = HmacSha256::new_from_slice(&self.key)
.expect("HMAC key invalid");
mac.update(&counter.to_be_bytes());
mac.update(window);
let result = mac.finalize().into_bytes();
u64::from_be_bytes(result[..8].try_into().unwrap())
}
fn chunk(&self, data: &[u8]) -> Vec<Chunk> {
let mut chunks = Vec::new();
let mut pos = 0;
let mut counter = 0u64;
while pos < data.len() {
let boundary = self.find_boundary(&data[pos..], counter);
let chunk_end = (pos + boundary).min(data.len());
chunks.push(create_chunk(&data[pos..chunk_end]));
pos = chunk_end;
counter += 1;
}
chunks
}
fn find_boundary(&self, data: &[u8], counter: u64) -> usize {
let window_size = 64;
let mut boundary = MIN_SIZE;
for i in (MIN_SIZE..MAX_SIZE.min(data.len())).step_by(window_size) {
let window_start = i.saturating_sub(window_size);
let window = &data[window_start..i.min(data.len())];
let prf_value = self.prf(counter, window);
let target_size = AVG_SIZE as u64;
// Boundary condition randomized by key
if (prf_value % target_size) == 0 {
boundary = i;
break;
}
}
boundary.min(MAX_SIZE).min(data.len())
}
}Keyframe Alignment
For video files, Dits adjusts chunk boundaries to align with keyframes (I-frames) when possible:
fn align_to_keyframe(
boundary: usize,
keyframes: &[usize],
tolerance: usize,
) -> usize {
// Find keyframes within tolerance of boundary
let candidates: Vec<_> = keyframes
.iter()
.filter(|&&kf| {
kf.abs_diff(boundary) <= tolerance
})
.collect();
// Return nearest keyframe, or original boundary
candidates
.into_iter()
.min_by_key(|&&kf| kf.abs_diff(boundary))
.copied()
.unwrap_or(boundary)
}
// Integration with CDC:
fn chunk_video(data: &[u8], keyframes: &[usize]) -> Vec<Chunk> {
let mut chunks = Vec::new();
let mut pos = 0;
while pos < data.len() {
// Find CDC boundary
let cdc_boundary = find_next_boundary(&data[pos..]);
// Align to keyframe if close
let aligned = align_to_keyframe(
pos + cdc_boundary,
keyframes,
TOLERANCE, // e.g., 64KB
);
let chunk_data = &data[pos..aligned];
chunks.push(create_chunk(chunk_data));
pos = aligned;
}
chunks
}Cryptographic Hashing
Dits supports multiple cryptographic hash algorithms for content addressing, allowing users to choose based on performance, security, or compliance requirements.
Supported Hash Algorithms
BLAKE3 (Default)
- Speed: 3+ GB/s per core
- Parallelism: Multi-threaded
- Security: BLAKE family
- Best for: Performance
SHA-256
- Speed: ~500 MB/s
- Parallelism: Single-threaded
- Security: SHA-2 family
- Best for: Compliance
SHA-3-256
- Speed: ~300 MB/s
- Parallelism: Single-threaded
- Security: SHA-3 family
- Best for: Future-proofing
BLAKE3 Implementation
BLAKE3 is Dits' default hash algorithm, providing exceptional performance and security for content addressing.
use blake3::Hasher;
fn hash_chunk(data: &[u8]) -> [u8; 32] {
let mut hasher = Hasher::new();
hasher.update(data);
*hasher.finalize().as_bytes()
}
// For large files, use parallel hashing:
fn hash_file_parallel(data: &[u8]) -> [u8; 32] {
// BLAKE3 automatically parallelizes for large inputs
blake3::hash(data).into()
}
// Incremental hashing (for streaming):
fn hash_streaming<R: Read>(reader: &mut R) -> io::Result<[u8; 32]> {
let mut hasher = Hasher::new();
let mut buffer = [0u8; 65536];
loop {
let n = reader.read(&mut buffer)?;
if n == 0 { break; }
hasher.update(&buffer[..n]);
}
Ok(*hasher.finalize().as_bytes())
}Algorithm Configuration
Dits allows configuration of chunking and hashing algorithms per repository or globally. This enables optimization for specific use cases.
# Configure chunking algorithm
dits config core.chunkingAlgorithm fastcdc
# Options: fastcdc, rabin, ae, chonkers, parallel-fastcdc, keyed-fastcdc
# Configure hashing algorithm
dits config core.hashAlgorithm blake3
# Options: blake3, sha256, sha3-256
# Configure KCDC key (for keyed chunking)
dits config core.chunkingKey "$(openssl rand -hex 32)"
# Per-repository settings override global defaults
dits config --local core.chunkingAlgorithm chonkersPerformance Considerations
- FastCDC: Best overall performance and deduplication
- Parallel FastCDC: Use for large files (>1GB)
- Chonkers: Use when strict guarantees are required
- Keyed FastCDC: Use for privacy-sensitive data
- BLAKE3: Recommended for all use cases
ISOBMFF Parsing
For MP4/MOV files, Dits parses the ISOBMFF container format to find keyframe positions and protect critical atoms:
struct Atom {
size: u64,
atom_type: [u8; 4],
offset: u64,
children: Vec<Atom>,
}
fn parse_atoms(data: &[u8]) -> Vec<Atom> {
let mut atoms = Vec::new();
let mut pos = 0;
while pos < data.len() {
let size = read_u32_be(&data[pos..]) as u64;
let atom_type = &data[pos + 4..pos + 8];
let actual_size = if size == 1 {
// Extended size
read_u64_be(&data[pos + 8..])
} else if size == 0 {
// Extends to end of file
(data.len() - pos) as u64
} else {
size
};
let atom = Atom {
size: actual_size,
atom_type: atom_type.try_into().unwrap(),
offset: pos as u64,
children: if is_container(atom_type) {
parse_atoms(&data[pos + 8..pos + actual_size as usize])
} else {
Vec::new()
},
};
atoms.push(atom);
pos += actual_size as usize;
}
atoms
}
fn find_keyframes(atoms: &[Atom], data: &[u8]) -> Vec<u64> {
// Navigate to moov/trak/mdia/minf/stbl/stss
// stss (Sync Sample) contains keyframe indices
let stss = find_atom_path(atoms, &[
b"moov", b"trak", b"mdia", b"minf", b"stbl", b"stss"
]);
if let Some(stss) = stss {
parse_stss(&data[stss.offset as usize..])
} else {
// No stss = all frames are keyframes (e.g., ProRes)
find_all_frame_positions(atoms, data)
}
}Protected Atoms
moov atom contains critical metadata. Dits never chunks through it - the entire moov is kept as a single chunk to ensure file playability.Delta Encoding
When transferring data, Dits uses delta encoding to minimize bandwidth:
// Chunk set comparison for sync
struct SyncRequest {
// Chunks the client has
have: HashSet<[u8; 32]>,
// Chunks the client wants
want: HashSet<[u8; 32]>,
}
struct SyncResponse {
// Chunks to transfer
chunks: Vec<ChunkData>,
// Chunks client already has (no transfer needed)
already_have: Vec<[u8; 32]>,
}
fn compute_delta(
client_have: &HashSet<[u8; 32]>,
server_have: &HashSet<[u8; 32]>,
needed: &HashSet<[u8; 32]>,
) -> Vec<[u8; 32]> {
needed
.iter()
.filter(|hash| !client_have.contains(*hash))
.filter(|hash| server_have.contains(*hash))
.copied()
.collect()
}Merkle Tree Verification
Dits uses Merkle trees for efficient verification of large datasets:
fn build_merkle_tree(chunks: &[[u8; 32]]) -> [u8; 32] {
if chunks.is_empty() {
return [0u8; 32];
}
if chunks.len() == 1 {
return chunks[0];
}
let mut level = chunks.to_vec();
while level.len() > 1 {
let mut next_level = Vec::new();
for pair in level.chunks(2) {
let hash = if pair.len() == 2 {
hash_pair(pair[0], pair[1])
} else {
pair[0]
};
next_level.push(hash);
}
level = next_level;
}
level[0]
}
// Verify a single chunk with proof
fn verify_chunk_proof(
chunk_hash: [u8; 32],
proof: &[ProofNode],
root: [u8; 32],
) -> bool {
let mut current = chunk_hash;
for node in proof {
current = match node {
ProofNode::Left(sibling) => hash_pair(*sibling, current),
ProofNode::Right(sibling) => hash_pair(current, *sibling),
};
}
current == root
}Compression Selection
Dits adaptively selects compression based on content type:
fn select_compression(data: &[u8], mime_type: &str) -> Compression {
// Pre-compressed formats: don't compress
if is_compressed_format(mime_type) {
return Compression::None;
}
// Test compressibility with small sample
let sample = &data[..data.len().min(65536)];
let compressed = zstd::encode_all(sample, 3)?;
let ratio = compressed.len() as f64 / sample.len() as f64;
if ratio > 0.95 {
// Not compressible
Compression::None
} else if data.len() > 10_000_000 {
// Large data: use faster compression
Compression::Zstd { level: 3 }
} else {
// Small data: use better compression
Compression::Zstd { level: 9 }
}
}
fn is_compressed_format(mime_type: &str) -> bool {
matches!(mime_type,
"video/mp4" | "video/quicktime" |
"video/x-matroska" |
"image/jpeg" | "image/png" |
"application/zip" | "application/gzip"
)
}Related Topics
- Data Structures - How algorithmic output is stored
- Network Protocol - How data is transferred
- Chunking & Deduplication - User-facing chunking concepts