Core Concepts

Understanding how Dits works will help you use it effectively. This page explains the key concepts behind Dits.

Content-Defined Chunking

Unlike Git which stores files as single objects, Dits splits files into variable-size chunks based on their content. This is called content-defined chunking (CDC).

Traditional Approach

File: video.mp4 (2GB)
├── Stored as single blob
└── Any change = re-store 2GB

Dits Approach

File: video.mp4 (2GB)
├── Chunk 1: 1.2 MB (hash: abc...)
├── Chunk 2: 0.9 MB (hash: def...)
├── ...
└── Only changed chunks stored

The chunking algorithm (FastCDC) uses a rolling hash to find chunk boundaries based on content, not fixed positions. This means:

Insertions/deletions don't cascade: If you insert data in the middle of a file, only the chunks near the insertion point change
Deduplication works across files: If two files share content (like different cuts of the same footage), they share chunks
Efficient syncing: Only new/changed chunks need to be transferred

Chunking Algorithms

Dits implements multiple content-defined chunking algorithms, each optimized for different use cases:

FastCDC (Default)

FastCDC (Fast Content-Defined Chunking) is Dits' primary algorithm, providing excellent performance and deduplication ratios.

Additional Chunking Algorithms

Beyond FastCDC, Dits implements several specialized chunking algorithms for different performance and security requirements:

Rabin Fingerprinting

Classic polynomial rolling hash algorithm
Strong locality guarantees (identical content = identical boundaries)
May produce more variable chunk sizes than FastCDC
Best for: Applications requiring strict content-aware boundaries

Asymmetric Extremum (AE)

Places boundaries at local minima/maxima in sliding windows
Better control over chunk size distribution
Reduces extreme chunk size variance
Best for: Consistent chunk sizes, lower metadata overhead

Chonkers Algorithm

Advanced layered algorithm with mathematical guarantees
Provable strict bounds on both chunk size AND edit locality
Uses hierarchical merging (balancing → caterpillar → diffbit phases)
Best for: Mission-critical applications requiring guarantees

Parallel FastCDC

Multi-core implementation of FastCDC
Splits large files into segments processed in parallel
2-4x throughput improvement on multi-core systems
Best for: Large files, high-throughput environments

Keyed FastCDC (KCDC)

Security-enhanced FastCDC with secret key
Prevents fingerprinting attacks via chunk length patterns
Same performance as FastCDC with added privacy protection
Best for: Encrypted backups, privacy-sensitive applications

How Chunk Boundaries Are Determined

Fixed-size chunking problem

Insert X → All chunks shift!

AAAA

BBBB

CCCC

DDDD

ALL chunks changed! 0% reuse

CDC solution

Insert X → Only first chunk changes

AAA|A

BBB|B

CCC|C

DDD|D

1 new chunk! 80%+ reuse

Algorithm Parameters

FastCDC uses carefully tuned parameters for optimal performance:

// FastCDC configuration for video files
min_size: 32KB     // Minimum chunk size
avg_size: 64KB     // Target average size
max_size: 256KB    // Maximum chunk size
normalization: 2   // Size distribution control

Rolling Hash Implementation

FastCDC uses a "gear hash" - a precomputed table of random 64-bit values:

// Rolling hash state
hash = 0

// For each byte in the file:
hash = (hash << 1) + gear_table[byte_value]

// Check if hash matches boundary pattern:
// (hash & mask) == 0 → create chunk boundary

Performance Characteristics

Implementation	Throughput	Platform
Scalar (baseline)	800 MB/s	All CPUs
SSE4.1	1.2 GB/s	Intel/AMD
AVX2	2.0 GB/s	Modern Intel/AMD
AVX-512	3.5 GB/s	High-end Intel
ARM NEON	1.5-2.5 GB/s	Apple Silicon, ARM64

Content Addressing

Every piece of data in Dits is identified by its content hash, specifically a BLAKE3 hash. This is called content addressing.

# Every chunk has a unique hash based on its content
Chunk abc123... = specific 1.2MB of video data
Chunk def456... = specific 0.9MB of video data

# Files are just lists of chunk hashes
video.mp4 = [abc123, def456, ghi789, ...]

# Commits reference file manifests by hash
Commit xyz... -> Manifest hash -> File hashes -> Chunk hashes

Benefits of content addressing:

Automatic deduplication: Identical content always has the same hash, so it's only stored once
Data integrity: If a chunk's hash doesn't match, you know it's corrupted
Immutability: You can't modify stored data without changing its address

Cryptographic Hashing

Dits supports multiple cryptographic hash algorithms for different performance and security trade-offs:

BLAKE3 (Default)

Dits uses BLAKE3 as the default hash algorithm for its exceptional performance and security:

Property	SHA-256	BLAKE3
Speed	~500 MB/s	3-6 GB/s (multi-threaded)
Parallelism	Single-threaded	Multi-threaded
Security	Proven	Proven (BLAKE family)

Alternative Hash Algorithms

SHA-256

Industry standard cryptographic hash
Widely trusted and analyzed
~2x slower than BLAKE3
Best for: Regulatory compliance, maximum compatibility

SHA-3-256

Future-proof cryptographic construction
Different algorithm family than SHA-2
~3x slower than BLAKE3
Best for: Post-quantum security considerations

Hash Algorithm Selection

# Configure repository to use different hash algorithm
dits config core.hashAlgorithm sha256

# Available options: blake3, sha256, sha3-256
# Default: blake3 (recommended for performance)

All hash algorithms produce 256-bit (32-byte) outputs and provide cryptographic security guarantees.

Cryptographic Properties

Collision resistance: Impossible to find two different inputs with same hash
Preimage resistance: Given a hash, impossible to find input that produces it
Second preimage resistance: Given input A, impossible to find input B with same hash

Hybrid Storage Architecture

Dits uses a hybrid storage system that intelligently chooses the optimal storage method for different types of files. This combines the best of Git's text handling with Dits' binary optimizations.

Text Files (Git Storage)

Source code: .rs, .js, .py, .cpp, etc.
Config files: .json, .yaml, .toml, .xml
Documentation: .md, .txt, .rst
Benefits: Line-based diffs, 3-way merge, blame

Binary Assets (Dits Storage)

Video: .mp4, .mov, .avi, .mkv
3D Models: .obj, .fbx, .gltf, .usd
Game Assets: Unity, Unreal, Godot files
Images: .psd, .raw, large .png/.jpg
Benefits: FastCDC chunking, deduplication

The system automatically classifies files based on extension, content analysis, and filename patterns. This ensures optimal performance and features for each file type while maintaining Git compatibility.

Best of Both Worlds

Use Git's powerful text operations for code while benefiting from Dits' binary optimizations for creative assets. All files coexist in the same repository with unified version control.

Working Alongside Git

Dits is designed to work alongside Git in the same project directory. Initialize both repositories separately (git init then dits init) to get hybrid storage that automatically uses the best system for each file type.

Manifest System

The manifest is Dits' authoritative record of a commit's file tree. It describes how to reconstruct files from chunks and stores rich metadata.

What a Manifest Contains

Each manifest includes:

All files in the repository at that commit
File metadata (size, permissions, timestamps)
Chunk references for reconstructing content
Asset metadata (video dimensions, codec, duration)
Directory structure for efficient browsing
Dependency graphs for project files

Manifest Data Structure

pub struct ManifestPayload {
    pub version: u8,                    // Format version
    pub repo_id: Uuid,                  // Repository identifier
    pub commit_hash: [u8; 32],          // This commit's hash
    pub parent_hash: Option<[u8; 32]>, // Parent commit (for diffs)

    pub entries: Vec<ManifestEntry>,    // All files
    pub directories: Vec<DirectoryEntry>, // Directory structure
    pub dependencies: Option<DependencyGraph>, // File relationships
    pub stats: ManifestStats,           // Aggregate statistics
}

File Representation

Each file is represented as a manifest entry:

pub struct ManifestEntry {
    pub path: String,                  // Relative path
    pub size: u64,                     // File size in bytes
    pub content_hash: [u8; 32],        // Full file BLAKE3 hash
    pub chunks: Vec<ChunkRef>,         // How to reconstruct file

    // Rich metadata
    pub metadata: FileMetadata,        // MIME type, encoding, etc.
    pub asset_metadata: Option<AssetMetadata>, // Video/audio specifics
}

Asset Metadata Extraction

For media files, Dits extracts comprehensive metadata:

pub struct AssetMetadata {
    pub asset_type: AssetType,        // Video, Audio, Image
    pub duration_ms: Option<u64>,     // Playback duration
    pub width: Option<u32>,           // Video width
    pub height: Option<u32>,          // Video height
    pub video_codec: Option<String>,  // "h264", "prores", etc.
    pub audio_codec: Option<String>,  // "aac", "pcm", etc.

    // Camera metadata
    pub camera_metadata: Option<CameraMetadata>,
    pub thumbnail: Option<[u8; 32]>,  // Thumbnail chunk hash
}

Repository Structure

A Dits repository is stored in a .dits directory with this structure:

.dits/
├── HEAD              # Current branch reference
├── config            # Repository configuration
├── index             # Staging area
├── objects/          # Content-addressed storage
│   ├── chunks/       # File chunks
│   ├── manifests/    # File manifests
│   └── commits/      # Commit objects
└── refs/
    ├── heads/        # Branch refs
    └── tags/         # Tag refs

Object Types

Chunk

The fundamental unit of storage. A variable-size piece of file content, typically 256KB to 4MB.

Manifest

Describes how to reconstruct a file from chunks. Contains the ordered list of chunk hashes, file metadata (size, permissions), and optional video metadata.

Commit

A snapshot of the repository at a point in time. Contains:

Tree (manifest) hash pointing to file state
Parent commit hash(es)
Author and committer information
Commit message
Timestamp

Branch

A mutable reference to a commit. The default branch is main. Branches make it easy to work on different versions simultaneously.

Tag

An immutable reference to a commit, typically used to mark releases or important versions.

Sync Protocol and Delta Efficiency

Dits uses a sophisticated sync protocol to minimize bandwidth usage.

Have/Want Protocol

Instead of sending entire files, Dits negotiates what data is needed:

Have/Want Protocol

Local

Chunks:

ABCDEF

Remote

Chunks:

Query what remote has

"Do you have A, B, C, D, E, F?"

Remote responds with Bloom filter

Have:ABMissing:CDEF

Upload only missing chunks

→ Transfer C, D, E, F (not A, B!)

Delta Sync Efficiency

Traditional sync

File changed → upload entire file

10 GB video, small edit → transfer 10 GB

Dits delta sync

File changed → identify changed chunks

10 GB video, small edit → transfer ~50 MB

Performance Characteristics

Download Performance Optimizations

Dits implements multiple optimizations to maximize download speeds and utilize full network capacity:

Streaming FastCDC

Problem: Memory-bound chunking
Solution: 64KB sliding window
Result: Process any file size
Memory: 99.9% reduction vs buffered

Parallel Processing

Multi-core chunking: 3-4x speedup
Parallel downloads: Aggregate bandwidth
Concurrent transfers: 1000+ streams
Zero-copy I/O: 50-70% less CPU

High-Throughput QUIC Transport

Concurrent streams: 1000+ parallel transfers
Large flow windows: 16MB buffers for high bandwidth
Connection pooling: Reuse connections, eliminate handshakes
BBR congestion control: Optimized for modern networks

Adaptive Chunk Sizing

Network Type	Optimal Chunk Size	Strategy
LAN (>1Gbps)	8MB	Maximum throughput
Fast broadband (100Mbps)	2MB	Balanced performance
High latency (satellite)	256KB	Responsiveness priority

Maximum Speed Downloads

Downloads now utilize 100% of available bandwidth with no software limitations, scaling linearly with the number of available peers.

Throughput Benchmarks

Operation	Performance	Notes
Streaming Chunking	Unlimited	No memory limits
Parallel Chunking	8+ GB/s	Multi-core processing
QUIC Transfer	1+ GB/s	1000+ concurrent streams
Multi-peer Download	N × peer bandwidth	Linear scaling with peers
Hashing (BLAKE3)	6 GB/s	Multi-threaded
File reconstruction	500 MB/s	Sequential reads

Video-Aware Features

Why Video-Aware Matters

Video files have internal structure (containers, tracks, keyframes). Dits understands this structure to optimize chunking and reconstruction.

For MP4/MOV files, Dits:

Preserves container structure: The moov atom (metadata) is kept intact
Aligns to keyframes: Chunk boundaries prefer I-frames for better deduplication of related footage
Extracts metadata: Duration, resolution, codec info is stored for quick access

Deduplication in Action

Consider this scenario:

# You have two versions of the same footage
scene01_take1.mp4  (10 GB, 10,000 chunks)
scene01_take2.mp4  (10 GB, 10,000 chunks)

# But 95% of the content is identical
# Dits stores:
- 10,000 unique chunks from take1
- 500 unique chunks from take2
- Total: 10,500 chunks (~10.5 GB) instead of 20 GB

# Deduplication savings: 47.5%

The more similar content you have, the greater the savings. This is especially powerful for:

Multiple takes of the same scene
Different cuts/edits of the same footage
Footage from the same camera/location
Projects that share B-roll or stock footage

Real-World Deduplication Scenarios

Scenario	Raw Size	Deduplicated	Savings
5 versions of video (minor edits)	50 GB	12 GB	76%
100 similar photos (same shoot)	50 GB	8 GB	84%
10 game builds (iterative)	100 GB	18 GB	82%

Security & Integrity

Content Addressing Security

Every piece of data is identified by its cryptographic hash:

Content → BLAKE3 hash → Storage

If content changes by even 1 bit:
  → Completely different hash
  → Stored as new content
  → Tampering is detectable

Verification Commands

$ dits fsck
Verifying repository integrity...
Checking objects... ✓
Checking references... ✓
Checking manifests... ✓
Verifying 45,678 chunks...
  [████████████████████████████████] 100%
All chunks verified ✓
Repository is healthy.

Encryption Options

In transit: All network transfers use TLS 1.3 or QUIC
At rest (optional): Files encrypted before storage

Comparison with Alternatives

Git LFS

Git Repository:          LFS Server:
┌─────────────┐          ┌─────────────┐
│ version 1   │ ──────▶  │ 10 GB file  │
│ (pointer)   │          ├─────────────┤
│ version 2   │ ──────▶  │ 10 GB file  │
│ (pointer)   │          │ (full copy) │
└─────────────┘          └─────────────┘
Total: 20 GB stored

Dits

Dits Repository:
┌─────────────────────────────────────┐
│ Manifest: video.mp4 = [A,B,C,D,E]   │
│ Chunks: A,B,C,D,E (10 GB total)     │
│                                     │
│ Version 2: video.mp4 = [A,B,C,F,G]  │
│ Chunks: A,B,C,F,G (only F,G new)   │
└─────────────────────────────────────┘
Total: ~10.2 GB stored

Feature	Git LFS	Dits
Storage per version	Full copy	Changed chunks only
Diff capability	None	Chunk-level diff
Merge conflicts	Manual resolution	Explicit locking

Virtual Filesystem (VFS)

Dits can mount a repository as a virtual drive using FUSE. Files appear instantly but are only "hydrated" (chunks downloaded) when accessed.

# Mount the repository
$ dits mount /mnt/project

# Files appear immediately
$ ls /mnt/project/footage/
scene01.mp4  scene02.mp4  scene03.mp4

# Opening a file triggers on-demand hydration
$ ffplay /mnt/project/footage/scene01.mp4
# Only accessed chunks are fetched

This is ideal for:

Previewing large projects without full download
NLE (editing software) integration
Accessing specific files from a large repository

Core Concepts

Content-Defined Chunking

Chunking Algorithms

FastCDC (Default)

Additional Chunking Algorithms

How Chunk Boundaries Are Determined

Fixed-size chunking problem

CDC solution

Algorithm Parameters

Rolling Hash Implementation

Performance Characteristics

Content Addressing

Cryptographic Hashing

BLAKE3 (Default)

Alternative Hash Algorithms

Hash Algorithm Selection

Cryptographic Properties

Hybrid Storage Architecture

Best of Both Worlds

Working Alongside Git

Manifest System

What a Manifest Contains

Manifest Data Structure

File Representation

Asset Metadata Extraction

Repository Structure

Object Types

Chunk

Manifest

Commit

Branch

Tag

Sync Protocol and Delta Efficiency

Have/Want Protocol

Have/Want Protocol

Delta Sync Efficiency

Performance Characteristics

Download Performance Optimizations

High-Throughput QUIC Transport

Adaptive Chunk Sizing

Maximum Speed Downloads

Throughput Benchmarks

Video-Aware Features

Why Video-Aware Matters

Deduplication in Action

Real-World Deduplication Scenarios

Security & Integrity

Content Addressing Security

Verification Commands

Encryption Options

Comparison with Alternatives

Git LFS

Virtual Filesystem (VFS)

Next Steps