Skip to main content
Documentation

Core Concepts

Understanding how Dits works will help you use it effectively. This page explains the key concepts behind Dits.

Content-Defined Chunking

Unlike Git which stores files as single objects, Dits splits files into variable-size chunks based on their content. This is called content-defined chunking (CDC).

Traditional Approach
File: video.mp4 (2GB)
├── Stored as single blob
└── Any change = re-store 2GB
Dits Approach
File: video.mp4 (2GB)
├── Chunk 1: 1.2 MB (hash: abc...)
├── Chunk 2: 0.9 MB (hash: def...)
├── ...
└── Only changed chunks stored

The chunking algorithm (FastCDC) uses a rolling hash to find chunk boundaries based on content, not fixed positions. This means:

  • Insertions/deletions don't cascade: If you insert data in the middle of a file, only the chunks near the insertion point change
  • Deduplication works across files: If two files share content (like different cuts of the same footage), they share chunks
  • Efficient syncing: Only new/changed chunks need to be transferred

Chunking Algorithms

Dits implements multiple content-defined chunking algorithms, each optimized for different use cases:

FastCDC (Default)

FastCDC (Fast Content-Defined Chunking) is Dits' primary algorithm, providing excellent performance and deduplication ratios.

Additional Chunking Algorithms

Beyond FastCDC, Dits implements several specialized chunking algorithms for different performance and security requirements:

Rabin Fingerprinting
  • Classic polynomial rolling hash algorithm
  • Strong locality guarantees (identical content = identical boundaries)
  • May produce more variable chunk sizes than FastCDC
  • Best for: Applications requiring strict content-aware boundaries
Asymmetric Extremum (AE)
  • Places boundaries at local minima/maxima in sliding windows
  • Better control over chunk size distribution
  • Reduces extreme chunk size variance
  • Best for: Consistent chunk sizes, lower metadata overhead
Chonkers Algorithm
  • Advanced layered algorithm with mathematical guarantees
  • Provable strict bounds on both chunk size AND edit locality
  • Uses hierarchical merging (balancing → caterpillar → diffbit phases)
  • Best for: Mission-critical applications requiring guarantees
Parallel FastCDC
  • Multi-core implementation of FastCDC
  • Splits large files into segments processed in parallel
  • 2-4x throughput improvement on multi-core systems
  • Best for: Large files, high-throughput environments
Keyed FastCDC (KCDC)
  • Security-enhanced FastCDC with secret key
  • Prevents fingerprinting attacks via chunk length patterns
  • Same performance as FastCDC with added privacy protection
  • Best for: Encrypted backups, privacy-sensitive applications

How Chunk Boundaries Are Determined

Fixed-size chunking problem

Insert X → All chunks shift!

AAAA
BBBB
CCCC
DDDD
ALL chunks changed! 0% reuse

CDC solution

Insert X → Only first chunk changes

X
AAA|A
BBB|B
CCC|C
DDD|D
1 new chunk! 80%+ reuse

Algorithm Parameters

FastCDC uses carefully tuned parameters for optimal performance:

// FastCDC configuration for video files
min_size: 32KB     // Minimum chunk size
avg_size: 64KB     // Target average size
max_size: 256KB    // Maximum chunk size
normalization: 2   // Size distribution control

Rolling Hash Implementation

FastCDC uses a "gear hash" - a precomputed table of random 64-bit values:

// Rolling hash state
hash = 0

// For each byte in the file:
hash = (hash << 1) + gear_table[byte_value]

// Check if hash matches boundary pattern:
// (hash & mask) == 0 → create chunk boundary

Performance Characteristics

ImplementationThroughputPlatform
Scalar (baseline)800 MB/sAll CPUs
SSE4.11.2 GB/sIntel/AMD
AVX22.0 GB/sModern Intel/AMD
AVX-5123.5 GB/sHigh-end Intel
ARM NEON1.5-2.5 GB/sApple Silicon, ARM64

Content Addressing

Every piece of data in Dits is identified by its content hash, specifically a BLAKE3 hash. This is called content addressing.

# Every chunk has a unique hash based on its content
Chunk abc123... = specific 1.2MB of video data
Chunk def456... = specific 0.9MB of video data

# Files are just lists of chunk hashes
video.mp4 = [abc123, def456, ghi789, ...]

# Commits reference file manifests by hash
Commit xyz... -> Manifest hash -> File hashes -> Chunk hashes

Benefits of content addressing:

  • Automatic deduplication: Identical content always has the same hash, so it's only stored once
  • Data integrity: If a chunk's hash doesn't match, you know it's corrupted
  • Immutability: You can't modify stored data without changing its address

Cryptographic Hashing

Dits supports multiple cryptographic hash algorithms for different performance and security trade-offs:

BLAKE3 (Default)

Dits uses BLAKE3 as the default hash algorithm for its exceptional performance and security:

PropertySHA-256BLAKE3
Speed~500 MB/s3-6 GB/s (multi-threaded)
ParallelismSingle-threadedMulti-threaded
SecurityProvenProven (BLAKE family)

Alternative Hash Algorithms

SHA-256
  • Industry standard cryptographic hash
  • Widely trusted and analyzed
  • ~2x slower than BLAKE3
  • Best for: Regulatory compliance, maximum compatibility
SHA-3-256
  • Future-proof cryptographic construction
  • Different algorithm family than SHA-2
  • ~3x slower than BLAKE3
  • Best for: Post-quantum security considerations

Hash Algorithm Selection

# Configure repository to use different hash algorithm
dits config core.hashAlgorithm sha256

# Available options: blake3, sha256, sha3-256
# Default: blake3 (recommended for performance)

All hash algorithms produce 256-bit (32-byte) outputs and provide cryptographic security guarantees.

Cryptographic Properties

  • Collision resistance: Impossible to find two different inputs with same hash
  • Preimage resistance: Given a hash, impossible to find input that produces it
  • Second preimage resistance: Given input A, impossible to find input B with same hash

Hybrid Storage Architecture

Dits uses a hybrid storage system that intelligently chooses the optimal storage method for different types of files. This combines the best of Git's text handling with Dits' binary optimizations.

Text Files (Git Storage)
  • Source code: .rs, .js, .py, .cpp, etc.
  • Config files: .json, .yaml, .toml, .xml
  • Documentation: .md, .txt, .rst
  • Benefits: Line-based diffs, 3-way merge, blame
Binary Assets (Dits Storage)
  • Video: .mp4, .mov, .avi, .mkv
  • 3D Models: .obj, .fbx, .gltf, .usd
  • Game Assets: Unity, Unreal, Godot files
  • Images: .psd, .raw, large .png/.jpg
  • Benefits: FastCDC chunking, deduplication

The system automatically classifies files based on extension, content analysis, and filename patterns. This ensures optimal performance and features for each file type while maintaining Git compatibility.

Manifest System

The manifest is Dits' authoritative record of a commit's file tree. It describes how to reconstruct files from chunks and stores rich metadata.

What a Manifest Contains

Each manifest includes:

  • All files in the repository at that commit
  • File metadata (size, permissions, timestamps)
  • Chunk references for reconstructing content
  • Asset metadata (video dimensions, codec, duration)
  • Directory structure for efficient browsing
  • Dependency graphs for project files

Manifest Data Structure

pub struct ManifestPayload {
    pub version: u8,                    // Format version
    pub repo_id: Uuid,                  // Repository identifier
    pub commit_hash: [u8; 32],          // This commit's hash
    pub parent_hash: Option<[u8; 32]>, // Parent commit (for diffs)

    pub entries: Vec<ManifestEntry>,    // All files
    pub directories: Vec<DirectoryEntry>, // Directory structure
    pub dependencies: Option<DependencyGraph>, // File relationships
    pub stats: ManifestStats,           // Aggregate statistics
}

File Representation

Each file is represented as a manifest entry:

pub struct ManifestEntry {
    pub path: String,                  // Relative path
    pub size: u64,                     // File size in bytes
    pub content_hash: [u8; 32],        // Full file BLAKE3 hash
    pub chunks: Vec<ChunkRef>,         // How to reconstruct file

    // Rich metadata
    pub metadata: FileMetadata,        // MIME type, encoding, etc.
    pub asset_metadata: Option<AssetMetadata>, // Video/audio specifics
}

Asset Metadata Extraction

For media files, Dits extracts comprehensive metadata:

pub struct AssetMetadata {
    pub asset_type: AssetType,        // Video, Audio, Image
    pub duration_ms: Option<u64>,     // Playback duration
    pub width: Option<u32>,           // Video width
    pub height: Option<u32>,          // Video height
    pub video_codec: Option<String>,  // "h264", "prores", etc.
    pub audio_codec: Option<String>,  // "aac", "pcm", etc.

    // Camera metadata
    pub camera_metadata: Option<CameraMetadata>,
    pub thumbnail: Option<[u8; 32]>,  // Thumbnail chunk hash
}

Repository Structure

A Dits repository is stored in a .dits directory with this structure:

.dits/
├── HEAD              # Current branch reference
├── config            # Repository configuration
├── index             # Staging area
├── objects/          # Content-addressed storage
│   ├── chunks/       # File chunks
│   ├── manifests/    # File manifests
│   └── commits/      # Commit objects
└── refs/
    ├── heads/        # Branch refs
    └── tags/         # Tag refs

Object Types

Chunk

The fundamental unit of storage. A variable-size piece of file content, typically 256KB to 4MB.

Manifest

Describes how to reconstruct a file from chunks. Contains the ordered list of chunk hashes, file metadata (size, permissions), and optional video metadata.

Commit

A snapshot of the repository at a point in time. Contains:

  • Tree (manifest) hash pointing to file state
  • Parent commit hash(es)
  • Author and committer information
  • Commit message
  • Timestamp

Branch

A mutable reference to a commit. The default branch is main. Branches make it easy to work on different versions simultaneously.

Tag

An immutable reference to a commit, typically used to mark releases or important versions.

Sync Protocol and Delta Efficiency

Dits uses a sophisticated sync protocol to minimize bandwidth usage.

Have/Want Protocol

Instead of sending entire files, Dits negotiates what data is needed:

Have/Want Protocol

Local
Chunks:
ABCDEF
Remote
Chunks:
AB
1
Query what remote has
"Do you have A, B, C, D, E, F?"
2
Remote responds with Bloom filter
Have:ABMissing:CDEF
3
Upload only missing chunks
→ Transfer C, D, E, F (not A, B!)

Delta Sync Efficiency

Traditional sync

File changed → upload entire file

10 GB video, small edit → transfer 10 GB

Dits delta sync

File changed → identify changed chunks

10 GB video, small edit → transfer ~50 MB

Performance Characteristics

Download Performance Optimizations

Dits implements multiple optimizations to maximize download speeds and utilize full network capacity:

Streaming FastCDC
  • Problem: Memory-bound chunking
  • Solution: 64KB sliding window
  • Result: Process any file size
  • Memory: 99.9% reduction vs buffered
Parallel Processing
  • Multi-core chunking: 3-4x speedup
  • Parallel downloads: Aggregate bandwidth
  • Concurrent transfers: 1000+ streams
  • Zero-copy I/O: 50-70% less CPU

High-Throughput QUIC Transport

  • Concurrent streams: 1000+ parallel transfers
  • Large flow windows: 16MB buffers for high bandwidth
  • Connection pooling: Reuse connections, eliminate handshakes
  • BBR congestion control: Optimized for modern networks

Adaptive Chunk Sizing

Network TypeOptimal Chunk SizeStrategy
LAN (>1Gbps)8MBMaximum throughput
Fast broadband (100Mbps)2MBBalanced performance
High latency (satellite)256KBResponsiveness priority

Throughput Benchmarks

OperationPerformanceNotes
Streaming ChunkingUnlimitedNo memory limits
Parallel Chunking8+ GB/sMulti-core processing
QUIC Transfer1+ GB/s1000+ concurrent streams
Multi-peer DownloadN × peer bandwidthLinear scaling with peers
Hashing (BLAKE3)6 GB/sMulti-threaded
File reconstruction500 MB/sSequential reads

Video-Aware Features

For MP4/MOV files, Dits:

  • Preserves container structure: The moov atom (metadata) is kept intact
  • Aligns to keyframes: Chunk boundaries prefer I-frames for better deduplication of related footage
  • Extracts metadata: Duration, resolution, codec info is stored for quick access

Deduplication in Action

Consider this scenario:

# You have two versions of the same footage
scene01_take1.mp4  (10 GB, 10,000 chunks)
scene01_take2.mp4  (10 GB, 10,000 chunks)

# But 95% of the content is identical
# Dits stores:
- 10,000 unique chunks from take1
- 500 unique chunks from take2
- Total: 10,500 chunks (~10.5 GB) instead of 20 GB

# Deduplication savings: 47.5%

The more similar content you have, the greater the savings. This is especially powerful for:

  • Multiple takes of the same scene
  • Different cuts/edits of the same footage
  • Footage from the same camera/location
  • Projects that share B-roll or stock footage

Real-World Deduplication Scenarios

ScenarioRaw SizeDeduplicatedSavings
5 versions of video (minor edits)50 GB12 GB76%
100 similar photos (same shoot)50 GB8 GB84%
10 game builds (iterative)100 GB18 GB82%

Security & Integrity

Content Addressing Security

Every piece of data is identified by its cryptographic hash:

Content → BLAKE3 hash → Storage

If content changes by even 1 bit:
  → Completely different hash
  → Stored as new content
  → Tampering is detectable

Verification Commands

$ dits fsck
Verifying repository integrity...
Checking objects... ✓
Checking references... ✓
Checking manifests... ✓
Verifying 45,678 chunks...
  [████████████████████████████████] 100%
All chunks verified ✓
Repository is healthy.

Encryption Options

  • In transit: All network transfers use TLS 1.3 or QUIC
  • At rest (optional): Files encrypted before storage

Comparison with Alternatives

Git LFS

Git LFS
Git Repository:          LFS Server:
┌─────────────┐          ┌─────────────┐
│ version 1   │ ──────▶  │ 10 GB file  │
│ (pointer)   │          ├─────────────┤
│ version 2   │ ──────▶  │ 10 GB file  │
│ (pointer)   │          │ (full copy) │
└─────────────┘          └─────────────┘
Total: 20 GB stored
Dits
Dits Repository:
┌─────────────────────────────────────┐
│ Manifest: video.mp4 = [A,B,C,D,E]   │
│ Chunks: A,B,C,D,E (10 GB total)     │
│                                     │
│ Version 2: video.mp4 = [A,B,C,F,G]  │
│ Chunks: A,B,C,F,G (only F,G new)   │
└─────────────────────────────────────┘
Total: ~10.2 GB stored
FeatureGit LFSDits
Storage per versionFull copyChanged chunks only
Diff capabilityNoneChunk-level diff
Merge conflictsManual resolutionExplicit locking

Virtual Filesystem (VFS)

Dits can mount a repository as a virtual drive using FUSE. Files appear instantly but are only "hydrated" (chunks downloaded) when accessed.

# Mount the repository
$ dits mount /mnt/project

# Files appear immediately
$ ls /mnt/project/footage/
scene01.mp4  scene02.mp4  scene03.mp4

# Opening a file triggers on-demand hydration
$ ffplay /mnt/project/footage/scene01.mp4
# Only accessed chunks are fetched

This is ideal for:

  • Previewing large projects without full download
  • NLE (editing software) integration
  • Accessing specific files from a large repository

Next Steps