Core Concepts
Understanding how Dits works will help you use it effectively. This page explains the key concepts behind Dits.
Content-Defined Chunking
Unlike Git which stores files as single objects, Dits splits files into variable-size chunks based on their content. This is called content-defined chunking (CDC).
The chunking algorithm (FastCDC) uses a rolling hash to find chunk boundaries based on content, not fixed positions. This means:
- Insertions/deletions don't cascade: If you insert data in the middle of a file, only the chunks near the insertion point change
- Deduplication works across files: If two files share content (like different cuts of the same footage), they share chunks
- Efficient syncing: Only new/changed chunks need to be transferred
Chunking Algorithms
Dits implements multiple content-defined chunking algorithms, each optimized for different use cases:
FastCDC (Default)
FastCDC (Fast Content-Defined Chunking) is Dits' primary algorithm, providing excellent performance and deduplication ratios.
Additional Chunking Algorithms
Beyond FastCDC, Dits implements several specialized chunking algorithms for different performance and security requirements:
How Chunk Boundaries Are Determined
Fixed-size chunking problem
Insert X → All chunks shift!
CDC solution
Insert X → Only first chunk changes
Algorithm Parameters
FastCDC uses carefully tuned parameters for optimal performance:
// FastCDC configuration for video files
min_size: 32KB // Minimum chunk size
avg_size: 64KB // Target average size
max_size: 256KB // Maximum chunk size
normalization: 2 // Size distribution controlRolling Hash Implementation
FastCDC uses a "gear hash" - a precomputed table of random 64-bit values:
// Rolling hash state
hash = 0
// For each byte in the file:
hash = (hash << 1) + gear_table[byte_value]
// Check if hash matches boundary pattern:
// (hash & mask) == 0 → create chunk boundaryPerformance Characteristics
| Implementation | Throughput | Platform |
|---|---|---|
| Scalar (baseline) | 800 MB/s | All CPUs |
| SSE4.1 | 1.2 GB/s | Intel/AMD |
| AVX2 | 2.0 GB/s | Modern Intel/AMD |
| AVX-512 | 3.5 GB/s | High-end Intel |
| ARM NEON | 1.5-2.5 GB/s | Apple Silicon, ARM64 |
Content Addressing
Every piece of data in Dits is identified by its content hash, specifically a BLAKE3 hash. This is called content addressing.
# Every chunk has a unique hash based on its content
Chunk abc123... = specific 1.2MB of video data
Chunk def456... = specific 0.9MB of video data
# Files are just lists of chunk hashes
video.mp4 = [abc123, def456, ghi789, ...]
# Commits reference file manifests by hash
Commit xyz... -> Manifest hash -> File hashes -> Chunk hashesBenefits of content addressing:
- Automatic deduplication:Identical content always has the same hash, so it's only stored once
- Data integrity:If a chunk's hash doesn't match, you know it's corrupted
- Immutability:You can't modify stored data without changing its address
Cryptographic Hashing
Dits supports multiple cryptographic hash algorithms for different performance and security trade-offs:
BLAKE3 (Default)
Dits uses BLAKE3 as the default hash algorithm for its exceptional performance and security:
| Property | SHA-256 | BLAKE3 |
|---|---|---|
| Speed | ~500 MB/s | 3-6 GB/s (multi-threaded) |
| Parallelism | Single-threaded | Multi-threaded |
| Security | Proven | Proven (BLAKE family) |
Alternative Hash Algorithms
Hash Algorithm Selection
# Configure repository to use different hash algorithm
dits config core.hashAlgorithm sha256
# Available options: blake3, sha256, sha3-256
# Default: blake3 (recommended for performance)All hash algorithms produce 256-bit (32-byte) outputs and provide cryptographic security guarantees.
Cryptographic Properties
- Collision resistance: Impossible to find two different inputs with same hash
- Preimage resistance: Given a hash, impossible to find input that produces it
- Second preimage resistance: Given input A, impossible to find input B with same hash
Hybrid Storage Architecture
Dits uses a hybrid storage systemthat intelligently chooses the optimal storage method for different types of files. This combines the best of Git's text handling with Dits' binary optimizations.
The system automatically classifies files based on extension, content analysis, and filename patterns. This ensures optimal performance and features for each file type while maintaining Git compatibility.
git init then dits init) to get hybrid storage that automatically uses the best system for each file type.Manifest System
The manifest is Dits' authoritative record of a commit's file tree. It describes how to reconstruct files from chunks and stores rich metadata.
What a Manifest Contains
Each manifest includes:
- All files in the repository at that commit
- File metadata (size, permissions, timestamps)
- Chunk references for reconstructing content
- Asset metadata (video dimensions, codec, duration)
- Directory structure for efficient browsing
- Dependency graphs for project files
Manifest Data Structure
pub struct ManifestPayload {
pub version: u8, // Format version
pub repo_id: Uuid, // Repository identifier
pub commit_hash: [u8; 32], // This commit's hash
pub parent_hash: Option<[u8; 32]>, // Parent commit (for diffs)
pub entries: Vec<ManifestEntry>, // All files
pub directories: Vec<DirectoryEntry>, // Directory structure
pub dependencies: Option<DependencyGraph>, // File relationships
pub stats: ManifestStats, // Aggregate statistics
}File Representation
Each file is represented as a manifest entry:
pub struct ManifestEntry {
pub path: String, // Relative path
pub size: u64, // File size in bytes
pub content_hash: [u8; 32], // Full file BLAKE3 hash
pub chunks: Vec<ChunkRef>, // How to reconstruct file
// Rich metadata
pub metadata: FileMetadata, // MIME type, encoding, etc.
pub asset_metadata: Option<AssetMetadata>, // Video/audio specifics
}Asset Metadata Extraction
For media files, Dits extracts comprehensive metadata:
pub struct AssetMetadata {
pub asset_type: AssetType, // Video, Audio, Image
pub duration_ms: Option<u64>, // Playback duration
pub width: Option<u32>, // Video width
pub height: Option<u32>, // Video height
pub video_codec: Option<String>, // "h264", "prores", etc.
pub audio_codec: Option<String>, // "aac", "pcm", etc.
// Camera metadata
pub camera_metadata: Option<CameraMetadata>,
pub thumbnail: Option<[u8; 32]>, // Thumbnail chunk hash
}Repository Structure
A Dits repository is stored in a .dits directory with this structure:
.dits/
├── HEAD # Current branch reference
├── config # Repository configuration
├── index # Staging area
├── objects/ # Content-addressed storage
│ ├── chunks/ # File chunks
│ ├── manifests/ # File manifests
│ └── commits/ # Commit objects
└── refs/
├── heads/ # Branch refs
└── tags/ # Tag refsObject Types
Chunk
The fundamental unit of storage. A variable-size piece of file content, typically 256KB to 4MB.
Manifest
Describes how to reconstruct a file from chunks. Contains the ordered list of chunk hashes, file metadata (size, permissions), and optional video metadata.
Commit
A snapshot of the repository at a point in time. Contains:
- Tree (manifest) hash pointing to file state
- Parent commit hash(es)
- Author and committer information
- Commit message
- Timestamp
Branch
A mutable reference to a commit. The default branch is main. Branches make it easy to work on different versions simultaneously.
Tag
An immutable reference to a commit, typically used to mark releases or important versions.
Sync Protocol and Delta Efficiency
Dits uses a sophisticated sync protocol to minimize bandwidth usage.
Have/Want Protocol
Instead of sending entire files, Dits negotiates what data is needed:
Have/Want Protocol
Delta Sync Efficiency
Performance Characteristics
Download Performance Optimizations
Dits implements multiple optimizations to maximize download speeds and utilize full network capacity:
High-Throughput QUIC Transport
- Concurrent streams: 1000+ parallel transfers
- Large flow windows: 16MB buffers for high bandwidth
- Connection pooling: Reuse connections, eliminate handshakes
- BBR congestion control: Optimized for modern networks
Adaptive Chunk Sizing
| Network Type | Optimal Chunk Size | Strategy |
|---|---|---|
| LAN (>1Gbps) | 8MB | Maximum throughput |
| Fast broadband (100Mbps) | 2MB | Balanced performance |
| High latency (satellite) | 256KB | Responsiveness priority |
Throughput Benchmarks
| Operation | Performance | Notes |
|---|---|---|
| Streaming Chunking | Unlimited | No memory limits |
| Parallel Chunking | 8+ GB/s | Multi-core processing |
| QUIC Transfer | 1+ GB/s | 1000+ concurrent streams |
| Multi-peer Download | N × peer bandwidth | Linear scaling with peers |
| Hashing (BLAKE3) | 6 GB/s | Multi-threaded |
| File reconstruction | 500 MB/s | Sequential reads |
Video-Aware Features
For MP4/MOV files, Dits:
- Preserves container structure: The moov atom (metadata) is kept intact
- Aligns to keyframes: Chunk boundaries prefer I-frames for better deduplication of related footage
- Extracts metadata: Duration, resolution, codec info is stored for quick access
Deduplication in Action
Consider this scenario:
# You have two versions of the same footage
scene01_take1.mp4 (10 GB, 10,000 chunks)
scene01_take2.mp4 (10 GB, 10,000 chunks)
# But 95% of the content is identical
# Dits stores:
- 10,000 unique chunks from take1
- 500 unique chunks from take2
- Total: 10,500 chunks (~10.5 GB) instead of 20 GB
# Deduplication savings: 47.5%The more similar content you have, the greater the savings. This is especially powerful for:
- Multiple takes of the same scene
- Different cuts/edits of the same footage
- Footage from the same camera/location
- Projects that share B-roll or stock footage
Real-World Deduplication Scenarios
| Scenario | Raw Size | Deduplicated | Savings |
|---|---|---|---|
| 5 versions of video (minor edits) | 50 GB | 12 GB | 76% |
| 100 similar photos (same shoot) | 50 GB | 8 GB | 84% |
| 10 game builds (iterative) | 100 GB | 18 GB | 82% |
Security & Integrity
Content Addressing Security
Every piece of data is identified by its cryptographic hash:
Content → BLAKE3 hash → Storage
If content changes by even 1 bit:
→ Completely different hash
→ Stored as new content
→ Tampering is detectableVerification Commands
$ dits fsck
Verifying repository integrity...
Checking objects... ✓
Checking references... ✓
Checking manifests... ✓
Verifying 45,678 chunks...
[████████████████████████████████] 100%
All chunks verified ✓
Repository is healthy.Encryption Options
- In transit: All network transfers use TLS 1.3 or QUIC
- At rest (optional): Files encrypted before storage
Comparison with Alternatives
Git LFS
| Feature | Git LFS | Dits |
|---|---|---|
| Storage per version | Full copy | Changed chunks only |
| Diff capability | None | Chunk-level diff |
| Merge conflicts | Manual resolution | Explicit locking |
Virtual Filesystem (VFS)
Dits can mount a repository as a virtual drive using FUSE. Files appear instantly but are only "hydrated" (chunks downloaded) when accessed.
# Mount the repository
$ dits mount /mnt/project
# Files appear immediately
$ ls /mnt/project/footage/
scene01.mp4 scene02.mp4 scene03.mp4
# Opening a file triggers on-demand hydration
$ ffplay /mnt/project/footage/scene01.mp4
# Only accessed chunks are fetchedThis is ideal for:
- Previewing large projects without full download
- NLE (editing software) integration
- Accessing specific files from a large repository
Next Steps
- Learn about Chunking in Detail
- Understand Branching & Merging
- Explore Video Features