Core Concepts
Understanding how Dits works will help you use it effectively. This page explains the key concepts behind Dits.
Content-Defined Chunking
Unlike Git which stores files as single objects, Dits splits files into variable-size chunks based on their content. This is called content-defined chunking (CDC).
File: video.mp4 (2GB) ├── Stored as single blob └── Any change = re-store 2GB
File: video.mp4 (2GB) ├── Chunk 1: 1.2 MB (hash: abc...) ├── Chunk 2: 0.9 MB (hash: def...) ├── ... └── Only changed chunks stored
The chunking algorithm (FastCDC) uses a rolling hash to find chunk boundaries based on content, not fixed positions. This means:
- Insertions/deletions don't cascade: If you insert data in the middle of a file, only the chunks near the insertion point change
- Deduplication works across files: If two files share content (like different cuts of the same footage), they share chunks
- Efficient syncing: Only new/changed chunks need to be transferred
Chunking Algorithms
Dits implements multiple content-defined chunking algorithms, each optimized for different use cases:
FastCDC (Default)
FastCDC (Fast Content-Defined Chunking) is Dits' primary algorithm, providing excellent performance and deduplication ratios.
Additional Chunking Algorithms
Beyond FastCDC, Dits implements several specialized chunking algorithms for different performance and security requirements:
- Classic polynomial rolling hash algorithm
- Strong locality guarantees (identical content = identical boundaries)
- May produce more variable chunk sizes than FastCDC
- Best for: Applications requiring strict content-aware boundaries
- Places boundaries at local minima/maxima in sliding windows
- Better control over chunk size distribution
- Reduces extreme chunk size variance
- Best for: Consistent chunk sizes, lower metadata overhead
- Advanced layered algorithm with mathematical guarantees
- Provable strict bounds on both chunk size AND edit locality
- Uses hierarchical merging (balancing → caterpillar → diffbit phases)
- Best for: Mission-critical applications requiring guarantees
- Multi-core implementation of FastCDC
- Splits large files into segments processed in parallel
- 2-4x throughput improvement on multi-core systems
- Best for: Large files, high-throughput environments
- Security-enhanced FastCDC with secret key
- Prevents fingerprinting attacks via chunk length patterns
- Same performance as FastCDC with added privacy protection
- Best for: Encrypted backups, privacy-sensitive applications
How Chunk Boundaries Are Determined
Fixed-size chunking problem
Insert X → All chunks shift!
CDC solution
Insert X → Only first chunk changes
Algorithm Parameters
FastCDC uses carefully tuned parameters for optimal performance:
// FastCDC configuration for video files
min_size: 32KB // Minimum chunk size
avg_size: 64KB // Target average size
max_size: 256KB // Maximum chunk size
normalization: 2 // Size distribution controlRolling Hash Implementation
FastCDC uses a "gear hash" - a precomputed table of random 64-bit values:
// Rolling hash state
hash = 0
// For each byte in the file:
hash = (hash << 1) + gear_table[byte_value]
// Check if hash matches boundary pattern:
// (hash & mask) == 0 → create chunk boundaryPerformance Characteristics
| Implementation | Throughput | Platform |
|---|---|---|
| Scalar (baseline) | 800 MB/s | All CPUs |
| SSE4.1 | 1.2 GB/s | Intel/AMD |
| AVX2 | 2.0 GB/s | Modern Intel/AMD |
| AVX-512 | 3.5 GB/s | High-end Intel |
| ARM NEON | 1.5-2.5 GB/s | Apple Silicon, ARM64 |
Content Addressing
Every piece of data in Dits is identified by its content hash, specifically a BLAKE3 hash. This is called content addressing.
# Every chunk has a unique hash based on its content
Chunk abc123... = specific 1.2MB of video data
Chunk def456... = specific 0.9MB of video data
# Files are just lists of chunk hashes
video.mp4 = [abc123, def456, ghi789, ...]
# Commits reference file manifests by hash
Commit xyz... -> Manifest hash -> File hashes -> Chunk hashesBenefits of content addressing:
- Automatic deduplication: Identical content always has the same hash, so it's only stored once
- Data integrity: If a chunk's hash doesn't match, you know it's corrupted
- Immutability: You can't modify stored data without changing its address
Cryptographic Hashing
Dits supports multiple cryptographic hash algorithms for different performance and security trade-offs:
BLAKE3 (Default)
Dits uses BLAKE3 as the default hash algorithm for its exceptional performance and security:
| Property | SHA-256 | BLAKE3 |
|---|---|---|
| Speed | ~500 MB/s | 3-6 GB/s (multi-threaded) |
| Parallelism | Single-threaded | Multi-threaded |
| Security | Proven | Proven (BLAKE family) |
Alternative Hash Algorithms
- Industry standard cryptographic hash
- Widely trusted and analyzed
- ~2x slower than BLAKE3
- Best for: Regulatory compliance, maximum compatibility
- Future-proof cryptographic construction
- Different algorithm family than SHA-2
- ~3x slower than BLAKE3
- Best for: Post-quantum security considerations
Hash Algorithm Selection
# Configure repository to use different hash algorithm
dits config core.hashAlgorithm sha256
# Available options: blake3, sha256, sha3-256
# Default: blake3 (recommended for performance)All hash algorithms produce 256-bit (32-byte) outputs and provide cryptographic security guarantees.
Cryptographic Properties
- Collision resistance: Impossible to find two different inputs with same hash
- Preimage resistance: Given a hash, impossible to find input that produces it
- Second preimage resistance: Given input A, impossible to find input B with same hash
Hybrid Storage Architecture
Dits uses a hybrid storage system that intelligently chooses the optimal storage method for different types of files. This combines the best of Git's text handling with Dits' binary optimizations.
- Source code: .rs, .js, .py, .cpp, etc.
- Config files: .json, .yaml, .toml, .xml
- Documentation: .md, .txt, .rst
- Benefits: Line-based diffs, 3-way merge, blame
- Video: .mp4, .mov, .avi, .mkv
- 3D Models: .obj, .fbx, .gltf, .usd
- Game Assets: Unity, Unreal, Godot files
- Images: .psd, .raw, large .png/.jpg
- Benefits: FastCDC chunking, deduplication
The system automatically classifies files based on extension, content analysis, and filename patterns. This ensures optimal performance and features for each file type while maintaining Git compatibility.
Best of Both Worlds
Working Alongside Git
git init then dits init) to get hybrid storage that automatically uses the best system for each file type.Manifest System
The manifest is Dits' authoritative record of a commit's file tree. It describes how to reconstruct files from chunks and stores rich metadata.
What a Manifest Contains
Each manifest includes:
- All files in the repository at that commit
- File metadata (size, permissions, timestamps)
- Chunk references for reconstructing content
- Asset metadata (video dimensions, codec, duration)
- Directory structure for efficient browsing
- Dependency graphs for project files
Manifest Data Structure
pub struct ManifestPayload {
pub version: u8, // Format version
pub repo_id: Uuid, // Repository identifier
pub commit_hash: [u8; 32], // This commit's hash
pub parent_hash: Option<[u8; 32]>, // Parent commit (for diffs)
pub entries: Vec<ManifestEntry>, // All files
pub directories: Vec<DirectoryEntry>, // Directory structure
pub dependencies: Option<DependencyGraph>, // File relationships
pub stats: ManifestStats, // Aggregate statistics
}File Representation
Each file is represented as a manifest entry:
pub struct ManifestEntry {
pub path: String, // Relative path
pub size: u64, // File size in bytes
pub content_hash: [u8; 32], // Full file BLAKE3 hash
pub chunks: Vec<ChunkRef>, // How to reconstruct file
// Rich metadata
pub metadata: FileMetadata, // MIME type, encoding, etc.
pub asset_metadata: Option<AssetMetadata>, // Video/audio specifics
}Asset Metadata Extraction
For media files, Dits extracts comprehensive metadata:
pub struct AssetMetadata {
pub asset_type: AssetType, // Video, Audio, Image
pub duration_ms: Option<u64>, // Playback duration
pub width: Option<u32>, // Video width
pub height: Option<u32>, // Video height
pub video_codec: Option<String>, // "h264", "prores", etc.
pub audio_codec: Option<String>, // "aac", "pcm", etc.
// Camera metadata
pub camera_metadata: Option<CameraMetadata>,
pub thumbnail: Option<[u8; 32]>, // Thumbnail chunk hash
}Repository Structure
A Dits repository is stored in a .dits directory with this structure:
.dits/
├── HEAD # Current branch reference
├── config # Repository configuration
├── index # Staging area
├── objects/ # Content-addressed storage
│ ├── chunks/ # File chunks
│ ├── manifests/ # File manifests
│ └── commits/ # Commit objects
└── refs/
├── heads/ # Branch refs
└── tags/ # Tag refsObject Types
Chunk
The fundamental unit of storage. A variable-size piece of file content, typically 256KB to 4MB.
Manifest
Describes how to reconstruct a file from chunks. Contains the ordered list of chunk hashes, file metadata (size, permissions), and optional video metadata.
Commit
A snapshot of the repository at a point in time. Contains:
- Tree (manifest) hash pointing to file state
- Parent commit hash(es)
- Author and committer information
- Commit message
- Timestamp
Branch
A mutable reference to a commit. The default branch is main. Branches make it easy to work on different versions simultaneously.
Tag
An immutable reference to a commit, typically used to mark releases or important versions.
Sync Protocol and Delta Efficiency
Dits uses a sophisticated sync protocol to minimize bandwidth usage.
Have/Want Protocol
Instead of sending entire files, Dits negotiates what data is needed:
Have/Want Protocol
Delta Sync Efficiency
File changed → upload entire file
10 GB video, small edit → transfer 10 GB
File changed → identify changed chunks
10 GB video, small edit → transfer ~50 MB
Performance Characteristics
Download Performance Optimizations
Dits implements multiple optimizations to maximize download speeds and utilize full network capacity:
- Problem: Memory-bound chunking
- Solution: 64KB sliding window
- Result: Process any file size
- Memory: 99.9% reduction vs buffered
- Multi-core chunking: 3-4x speedup
- Parallel downloads: Aggregate bandwidth
- Concurrent transfers: 1000+ streams
- Zero-copy I/O: 50-70% less CPU
High-Throughput QUIC Transport
- Concurrent streams: 1000+ parallel transfers
- Large flow windows: 16MB buffers for high bandwidth
- Connection pooling: Reuse connections, eliminate handshakes
- BBR congestion control: Optimized for modern networks
Adaptive Chunk Sizing
| Network Type | Optimal Chunk Size | Strategy |
|---|---|---|
| LAN (>1Gbps) | 8MB | Maximum throughput |
| Fast broadband (100Mbps) | 2MB | Balanced performance |
| High latency (satellite) | 256KB | Responsiveness priority |
Maximum Speed Downloads
Throughput Benchmarks
| Operation | Performance | Notes |
|---|---|---|
| Streaming Chunking | Unlimited | No memory limits |
| Parallel Chunking | 8+ GB/s | Multi-core processing |
| QUIC Transfer | 1+ GB/s | 1000+ concurrent streams |
| Multi-peer Download | N × peer bandwidth | Linear scaling with peers |
| Hashing (BLAKE3) | 6 GB/s | Multi-threaded |
| File reconstruction | 500 MB/s | Sequential reads |
Video-Aware Features
Why Video-Aware Matters
For MP4/MOV files, Dits:
- Preserves container structure: The moov atom (metadata) is kept intact
- Aligns to keyframes: Chunk boundaries prefer I-frames for better deduplication of related footage
- Extracts metadata: Duration, resolution, codec info is stored for quick access
Deduplication in Action
Consider this scenario:
# You have two versions of the same footage
scene01_take1.mp4 (10 GB, 10,000 chunks)
scene01_take2.mp4 (10 GB, 10,000 chunks)
# But 95% of the content is identical
# Dits stores:
- 10,000 unique chunks from take1
- 500 unique chunks from take2
- Total: 10,500 chunks (~10.5 GB) instead of 20 GB
# Deduplication savings: 47.5%The more similar content you have, the greater the savings. This is especially powerful for:
- Multiple takes of the same scene
- Different cuts/edits of the same footage
- Footage from the same camera/location
- Projects that share B-roll or stock footage
Real-World Deduplication Scenarios
| Scenario | Raw Size | Deduplicated | Savings |
|---|---|---|---|
| 5 versions of video (minor edits) | 50 GB | 12 GB | 76% |
| 100 similar photos (same shoot) | 50 GB | 8 GB | 84% |
| 10 game builds (iterative) | 100 GB | 18 GB | 82% |
Security & Integrity
Content Addressing Security
Every piece of data is identified by its cryptographic hash:
Content → BLAKE3 hash → Storage
If content changes by even 1 bit:
→ Completely different hash
→ Stored as new content
→ Tampering is detectableVerification Commands
$ dits fsck
Verifying repository integrity...
Checking objects... ✓
Checking references... ✓
Checking manifests... ✓
Verifying 45,678 chunks...
[████████████████████████████████] 100%
All chunks verified ✓
Repository is healthy.Encryption Options
- In transit: All network transfers use TLS 1.3 or QUIC
- At rest (optional): Files encrypted before storage
Comparison with Alternatives
Git LFS
Git Repository: LFS Server: ┌─────────────┐ ┌─────────────┐ │ version 1 │ ──────▶ │ 10 GB file │ │ (pointer) │ ├─────────────┤ │ version 2 │ ──────▶ │ 10 GB file │ │ (pointer) │ │ (full copy) │ └─────────────┘ └─────────────┘ Total: 20 GB stored
Dits Repository: ┌─────────────────────────────────────┐ │ Manifest: video.mp4 = [A,B,C,D,E] │ │ Chunks: A,B,C,D,E (10 GB total) │ │ │ │ Version 2: video.mp4 = [A,B,C,F,G] │ │ Chunks: A,B,C,F,G (only F,G new) │ └─────────────────────────────────────┘ Total: ~10.2 GB stored
| Feature | Git LFS | Dits |
|---|---|---|
| Storage per version | Full copy | Changed chunks only |
| Diff capability | None | Chunk-level diff |
| Merge conflicts | Manual resolution | Explicit locking |
Virtual Filesystem (VFS)
Dits can mount a repository as a virtual drive using FUSE. Files appear instantly but are only "hydrated" (chunks downloaded) when accessed.
# Mount the repository
$ dits mount /mnt/project
# Files appear immediately
$ ls /mnt/project/footage/
scene01.mp4 scene02.mp4 scene03.mp4
# Opening a file triggers on-demand hydration
$ ffplay /mnt/project/footage/scene01.mp4
# Only accessed chunks are fetchedThis is ideal for:
- Previewing large projects without full download
- NLE (editing software) integration
- Accessing specific files from a large repository
Next Steps
- Learn about Chunking in Detail
- Understand Branching & Merging
- Explore Video Features