File Compaction

Arc's automatic compaction system merges small Parquet files into larger, optimized files for dramatically faster queries.

Overview

Compaction is Arc's file optimization system that merges small files into larger ones, improving query performance by 10-50x.

Key Features:

Automatic - Runs on schedule (default: hourly at :05)
Safe - Locked partitions prevent concurrent compaction
Efficient - Uses DuckDB for fast, parallel merging
Non-blocking - Queries work during compaction
Enabled by default - Essential for production

info

Compaction is enabled by default and runs automatically every hour.

Why Compaction Matters

The Small File Problem

Arc's high-performance ingestion (2.42M records/sec) creates many small files:

At 2.42M records/sec with 5-second flush:
→ 10M records every 5 seconds
→ 12 files per minute per measurement
→ 720 files per hour per measurement
→ 17,280 files per day per measurement

Impact on Queries:

Slow queries - DuckDB must open/scan hundreds of files
High costs - More S3/MinIO API calls
Poor compression - Small files compress less efficiently
Reduced pruning - Less effective partition elimination

After Compaction

Real Production Test Results:

Before: 2,704 small files (Snappy) = 3.7 GB
After:  3 compacted files (ZSTD)   = 724 MB

Compression: 80.4% space savings
File reduction: 901x fewer files (2,704 → 3)
Compaction time: 5 seconds

Per-Measurement Breakdown:

mem: 888 files → 1 file, 1,213 MB → 239 MB (80.3% compression)
disk: 906 files → 1 file, 1,237 MB → 242 MB (80.4% compression)
cpu: 910 files → 1 file, 1,246 MB → 243 MB (80.5% compression)

Query Performance:

10-50x faster - Single file scan vs hundreds
99% fewer API calls - Massive cost reduction (2,704 → 3 LIST operations)
80.4% compression - ZSTD compaction vs Snappy writes
Effective pruning - DuckDB can skip entire files

How It Works

Compaction Flow

1. Scheduler wakes up (cron: "5 * * * *")
   ↓
2. Scan storage for eligible partitions
   ↓
3. For each partition:
   - Check age (&gt;1 hour old?)
   - Check file count (≥10 files?)
   - Check if already compacted?
   ↓
4. Acquire partition lock (SQLite)
   ↓
5. Download small files to temp directory
   ↓
6. Compact using DuckDB (parallel, sorted)
   ↓
7. Upload compacted file to storage
   ↓
8. Delete old small files
   ↓
9. Release lock & cleanup temp files
   ↓
10. Repeat for next partition

Partition Structure

Data is organized by hour:

arc/                              # Bucket
├── default/                      # Database
│   └── cpu/                      # Measurement
│       └── 2025/10/08/           # Date
│           ├── 14/               # Hour (2 PM) - Eligible for compaction
│           │   ├── file1.parquet (50 MB)
│           │   ├── file2.parquet (48 MB)
│           │   └── ...
│           ├── 15/               # Hour (3 PM) - Eligible for compaction
│           └── 16/               # Hour (4 PM) - CURRENT, skip!

Compaction merges all files in a partition (e.g., 2025/10/08/14/) into one optimized file.

Configuration

Default Configuration

Compaction is enabled by default in arc.conf:

[compaction]
enabled = true
min_age_hours = 1              # Wait 1 hour before compacting (let hour complete)
min_files = 10                 # Only compact if ≥10 files exist
target_file_size_mb = 512      # Target size for compacted files
schedule = "5 * * * *"         # Cron schedule: every hour at :05
max_concurrent_jobs = 2        # Run 2 compactions in parallel
compression = "zstd"           # Better compression than snappy
compression_level = 3          # Balance compression vs speed

Configuration Options

Schedule

[compaction]
schedule = "5 * * * *"    # Every hour at :05 (default)
# schedule = "0 */2 * * *"  # Every 2 hours at :00
# schedule = "0 2 * * *"    # Daily at 2 AM

Cron format: minute hour day month weekday

Minimum Age

[compaction]
min_age_hours = 1    # Don't compact current hour (default)
# min_age_hours = 2    # Wait 2 hours (more conservative)
# min_age_hours = 0    # Compact immediately (aggressive)

caution

Setting min_age_hours = 0 can compact the current hour while data is still being written, potentially creating many compacted files.

Minimum Files

[compaction]
min_files = 10       # Only compact if ≥10 files (default)
# min_files = 50       # Only compact with many files
# min_files = 5        # Compact more aggressively

Target File Size

[compaction]
target_file_size_mb = 512    # Target 512MB files (default)
# target_file_size_mb = 1024   # Larger files (fewer files, longer compaction)
# target_file_size_mb = 256    # Smaller files (more files, faster compaction)

Concurrent Jobs

[compaction]
max_concurrent_jobs = 2    # Run 2 compactions in parallel (default)
# max_concurrent_jobs = 4    # More parallelism (use more CPU/memory)
# max_concurrent_jobs = 1    # Sequential (lower resource usage)

Compression

[compaction]
compression = "zstd"        # Best compression (default)
compression_level = 3       # Balance speed vs compression (default)

# Options:
# compression = "snappy"    # Fastest, lower compression
# compression = "gzip"      # Good compression, slower
# compression = "zstd"      # Best compression, good speed

Disable Compaction

[compaction]
enabled = false

When to disable:

Testing ingestion performance
Very low write volume (<10 files/hour)
Debugging compaction issues

warning

Disabling compaction will cause queries to slow down significantly as files accumulate.

Monitoring

Check Compaction Status

curl http://localhost:8000/api/compaction/status \
  -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
  "enabled": true,
  "running": false,
  "last_run": "2025-10-08T14:05:00Z",
  "next_run": "2025-10-08T15:05:00Z",
  "stats": {
    "total_jobs": 42,
    "successful_jobs": 40,
    "failed_jobs": 2,
    "total_files_compacted": 12580,
    "total_bytes_saved": 8589934592
  }
}

Get Detailed Statistics

curl http://localhost:8000/api/compaction/stats \
  -H "Authorization: Bearer YOUR_TOKEN"

List Eligible Partitions

curl http://localhost:8000/api/compaction/candidates \
  -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
  "candidates": [
    {
      "partition": "default/cpu/2025/10/08/14",
      "file_count": 150,
      "total_size_mb": 7500,
      "age_hours": 2.5,
      "eligible": true
    },
    {
      "partition": "default/mem/2025/10/08/14",
      "file_count": 120,
      "total_size_mb": 6000,
      "age_hours": 2.5,
      "eligible": true
    }
  ],
  "total_candidates": 2
}

Manually Trigger Compaction

curl -X POST http://localhost:8000/api/compaction/trigger \
  -H "Authorization: Bearer YOUR_TOKEN"

View Active Jobs

curl http://localhost:8000/api/compaction/jobs \
  -H "Authorization: Bearer YOUR_TOKEN"

View Job History

curl http://localhost:8000/api/compaction/history \
  -H "Authorization: Bearer YOUR_TOKEN"

Performance Impact

Compaction Performance

Test Environment: Apple M3 Max (14 cores, 36GB RAM)

Files	Size	Compaction Time	Final Size	Compression
888	1.2 GB	2.1s	239 MB	80.3%
906	1.2 GB	2.2s	242 MB	80.4%
910	1.2 GB	2.3s	243 MB	80.5%

Total: 2,704 files (3.7 GB) → 3 files (724 MB) in 6.6 seconds

Query Performance

Before Compaction:

SELECT * FROM cpu WHERE time > NOW() - INTERVAL 1 HOUR;
-- 5.2 seconds (scan 720 files)

After Compaction:

SELECT * FROM cpu WHERE time > NOW() - INTERVAL 1 HOUR;
-- 0.05 seconds (scan 1 file) - 104x faster!

Storage Savings

Original files (Snappy):  3.7 GB
Compacted files (ZSTD):   724 MB
Space saved:              80.4%

Best Practices

1. Let Compaction Run Automatically

The default schedule (hourly) works well for most use cases:

[compaction]
enabled = true
schedule = "5 * * * *"

2. Monitor Compaction Jobs

Set up alerts for:

Failed compaction jobs
Partitions with >1000 files
Compaction taking >10 minutes

3. Adjust Based on Write Volume

High volume (>10M records/sec):

[compaction]
min_files = 100          # Wait for more files
max_concurrent_jobs = 4  # More parallelism

Low volume (<100K records/sec):

[compaction]
min_files = 5            # Compact with fewer files
schedule = "0 */6 * * *" # Every 6 hours

4. Use Appropriate Target File Size

[compaction]
target_file_size_mb = 512    # Good default
# target_file_size_mb = 1024   # For very large datasets
# target_file_size_mb = 256    # For faster compaction

5. Reduce File Generation at Source

Best practice: Increase buffer sizes to generate fewer files:

[ingestion]
buffer_size = 200000        # Up from 50,000 (4x fewer files)
buffer_age_seconds = 10     # Up from 5 (2x fewer files)

Impact:

Files generated: 2,000/hour → 250/hour (8x reduction)
Compaction time: 150s → 20s (7x faster)
Memory usage: +300MB per worker

This is the most effective optimization - fewer files means faster compaction AND faster queries.

Troubleshooting

Compaction Not Running

Check status:

curl http://localhost:8000/api/compaction/status

Verify configuration:

# Check if enabled
grep "enabled" arc.conf

# Check schedule
grep "schedule" arc.conf

Check logs:

# Docker
docker-compose logs arc-api | grep compaction

# Native
tail -f logs/arc-api.log | grep compaction

Compaction Taking Too Long

Symptoms: Compaction jobs running for >30 minutes

Solutions:

Reduce target file size:

[compaction]
target_file_size_mb = 256  # Smaller chunks

Increase parallelism:
```
[compaction]
max_concurrent_jobs = 4
```
Reduce files at source:
```
[ingestion]
buffer_size = 200000
```

Out of Disk Space During Compaction

Symptoms: Compaction fails with disk space errors

Solutions:

Use temp directory on larger disk:
```
export TMPDIR=/mnt/large-disk/tmp
```
Reduce concurrent jobs:
```
[compaction]
max_concurrent_jobs = 1
```

Clean up old compacted files manually:

# Remove small files that were already compacted
find ./data -name "*.parquet" -size -10M -delete

Compaction Locks Not Releasing

Symptoms: Partitions stuck in "locked" state

Check locks:

# View active locks
sqlite3 ./data/arc.db "SELECT * FROM compaction_locks;"

Clear stale locks:

# Locks expire automatically after 2 hours
# Or manually clear:
sqlite3 ./data/arc.db "DELETE FROM compaction_locks WHERE expires_at < datetime('now');"

API Reference

GET /api/compaction/status

Get current compaction status.

Response:

{
  "enabled": true,
  "running": false,
  "last_run": "2025-10-08T14:05:00Z",
  "next_run": "2025-10-08T15:05:00Z"
}

GET /api/compaction/stats

Get detailed compaction statistics.

GET /api/compaction/candidates

List partitions eligible for compaction.

POST /api/compaction/trigger

Manually trigger compaction.

Response:

{
  "message": "Compaction triggered",
  "job_id": "comp_1696775400"
}

GET /api/compaction/jobs

View active compaction jobs.

GET /api/compaction/history

View compaction job history.

Summary

Compaction is essential for production deployments:

Benefits:

10-50x faster queries
80% storage savings
99% fewer API calls
Automatic and safe

Default configuration works for most cases:

[compaction]
enabled = true
schedule = "5 * * * *"
min_age_hours = 1
min_files = 10

Monitor regularly:

Check /api/compaction/status
Alert on failed jobs
Watch for partitions with >1000 files

Next Steps

Monitor Compaction - Set up health checks
Configure WAL - Add durability guarantees
Tune Performance - Optimize for your workload

Overview​

Why Compaction Matters​

The Small File Problem​

After Compaction​

How It Works​

Compaction Flow​

Partition Structure​

Configuration​

Default Configuration​

Configuration Options​

Schedule​

Minimum Age​

Minimum Files​

Target File Size​

Concurrent Jobs​

Compression​

Disable Compaction​

Monitoring​

Check Compaction Status​

Get Detailed Statistics​

List Eligible Partitions​

Manually Trigger Compaction​

View Active Jobs​

View Job History​

Performance Impact​

Compaction Performance​

Query Performance​

Storage Savings​

Best Practices​

1. Let Compaction Run Automatically​

2. Monitor Compaction Jobs​

3. Adjust Based on Write Volume​

4. Use Appropriate Target File Size​

5. Reduce File Generation at Source​

Troubleshooting​

Compaction Not Running​

Compaction Taking Too Long​

Out of Disk Space During Compaction​

Compaction Locks Not Releasing​

API Reference​

GET /api/compaction/status​

GET /api/compaction/stats​

GET /api/compaction/candidates​

POST /api/compaction/trigger​

GET /api/compaction/jobs​

GET /api/compaction/history​

Summary​

Next Steps​

Overview

Why Compaction Matters

The Small File Problem

After Compaction

How It Works

Compaction Flow

Partition Structure

Configuration

Default Configuration

Configuration Options

Schedule

Minimum Age

Minimum Files

Target File Size

Concurrent Jobs

Compression

Disable Compaction

Monitoring

Check Compaction Status

Get Detailed Statistics

List Eligible Partitions

Manually Trigger Compaction

View Active Jobs

View Job History

Performance Impact

Compaction Performance

Query Performance

Storage Savings

Best Practices

1. Let Compaction Run Automatically

2. Monitor Compaction Jobs

3. Adjust Based on Write Volume

4. Use Appropriate Target File Size

5. Reduce File Generation at Source

Troubleshooting

Compaction Not Running

Compaction Taking Too Long

Out of Disk Space During Compaction

Compaction Locks Not Releasing

API Reference

GET /api/compaction/status

GET /api/compaction/stats

GET /api/compaction/candidates

POST /api/compaction/trigger

GET /api/compaction/jobs

GET /api/compaction/history

Summary

Next Steps