Skip to main content

File Compaction

Arc's automatic compaction system merges small Parquet files into larger, optimized files for dramatically faster queries.

Overview

Compaction is Arc's file optimization system that merges small files into larger ones, improving query performance by 10-50x.

Key Features:

  • Automatic - Runs on schedule (default: hourly at :05)
  • Safe - Locked partitions prevent concurrent compaction
  • Efficient - Uses DuckDB for fast, parallel merging
  • Non-blocking - Queries work during compaction
  • Enabled by default - Essential for production
info

Compaction is enabled by default and runs automatically every hour.

Why Compaction Matters

The Small File Problem

Arc's high-performance ingestion (2.42M records/sec) creates many small files:

At 2.42M records/sec with 5-second flush:
→ 10M records every 5 seconds
→ 12 files per minute per measurement
→ 720 files per hour per measurement
→ 17,280 files per day per measurement

Impact on Queries:

  • Slow queries - DuckDB must open/scan hundreds of files
  • High costs - More S3/MinIO API calls
  • Poor compression - Small files compress less efficiently
  • Reduced pruning - Less effective partition elimination

After Compaction

Real Production Test Results:

Before: 2,704 small files (Snappy) = 3.7 GB
After: 3 compacted files (ZSTD) = 724 MB

Compression: 80.4% space savings
File reduction: 901x fewer files (2,704 → 3)
Compaction time: 5 seconds

Per-Measurement Breakdown:

  • mem: 888 files → 1 file, 1,213 MB → 239 MB (80.3% compression)
  • disk: 906 files → 1 file, 1,237 MB → 242 MB (80.4% compression)
  • cpu: 910 files → 1 file, 1,246 MB → 243 MB (80.5% compression)

Query Performance:

  • 10-50x faster - Single file scan vs hundreds
  • 99% fewer API calls - Massive cost reduction (2,704 → 3 LIST operations)
  • 80.4% compression - ZSTD compaction vs Snappy writes
  • Effective pruning - DuckDB can skip entire files

How It Works

Compaction Flow

1. Scheduler wakes up (cron: "5 * * * *")

2. Scan storage for eligible partitions

3. For each partition:
- Check age (>1 hour old?)
- Check file count (≥10 files?)
- Check if already compacted?

4. Acquire partition lock (SQLite)

5. Download small files to temp directory

6. Compact using DuckDB (parallel, sorted)

7. Upload compacted file to storage

8. Delete old small files

9. Release lock & cleanup temp files

10. Repeat for next partition

Partition Structure

Data is organized by hour:

arc/                              # Bucket
├── default/ # Database
│ └── cpu/ # Measurement
│ └── 2025/10/08/ # Date
│ ├── 14/ # Hour (2 PM) - Eligible for compaction
│ │ ├── file1.parquet (50 MB)
│ │ ├── file2.parquet (48 MB)
│ │ └── ...
│ ├── 15/ # Hour (3 PM) - Eligible for compaction
│ └── 16/ # Hour (4 PM) - CURRENT, skip!

Compaction merges all files in a partition (e.g., 2025/10/08/14/) into one optimized file.

Configuration

Default Configuration

Compaction is enabled by default in arc.conf:

[compaction]
enabled = true
min_age_hours = 1 # Wait 1 hour before compacting (let hour complete)
min_files = 10 # Only compact if ≥10 files exist
target_file_size_mb = 512 # Target size for compacted files
schedule = "5 * * * *" # Cron schedule: every hour at :05
max_concurrent_jobs = 2 # Run 2 compactions in parallel
compression = "zstd" # Better compression than snappy
compression_level = 3 # Balance compression vs speed

Configuration Options

Schedule

[compaction]
schedule = "5 * * * *" # Every hour at :05 (default)
# schedule = "0 */2 * * *" # Every 2 hours at :00
# schedule = "0 2 * * *" # Daily at 2 AM

Cron format: minute hour day month weekday

Minimum Age

[compaction]
min_age_hours = 1 # Don't compact current hour (default)
# min_age_hours = 2 # Wait 2 hours (more conservative)
# min_age_hours = 0 # Compact immediately (aggressive)
caution

Setting min_age_hours = 0 can compact the current hour while data is still being written, potentially creating many compacted files.

Minimum Files

[compaction]
min_files = 10 # Only compact if ≥10 files (default)
# min_files = 50 # Only compact with many files
# min_files = 5 # Compact more aggressively

Target File Size

[compaction]
target_file_size_mb = 512 # Target 512MB files (default)
# target_file_size_mb = 1024 # Larger files (fewer files, longer compaction)
# target_file_size_mb = 256 # Smaller files (more files, faster compaction)

Concurrent Jobs

[compaction]
max_concurrent_jobs = 2 # Run 2 compactions in parallel (default)
# max_concurrent_jobs = 4 # More parallelism (use more CPU/memory)
# max_concurrent_jobs = 1 # Sequential (lower resource usage)

Compression

[compaction]
compression = "zstd" # Best compression (default)
compression_level = 3 # Balance speed vs compression (default)

# Options:
# compression = "snappy" # Fastest, lower compression
# compression = "gzip" # Good compression, slower
# compression = "zstd" # Best compression, good speed

Disable Compaction

[compaction]
enabled = false

When to disable:

  • Testing ingestion performance
  • Very low write volume (<10 files/hour)
  • Debugging compaction issues
warning

Disabling compaction will cause queries to slow down significantly as files accumulate.

Monitoring

Check Compaction Status

curl http://localhost:8000/api/compaction/status \
-H "Authorization: Bearer YOUR_TOKEN"

Response:

{
"enabled": true,
"running": false,
"last_run": "2025-10-08T14:05:00Z",
"next_run": "2025-10-08T15:05:00Z",
"stats": {
"total_jobs": 42,
"successful_jobs": 40,
"failed_jobs": 2,
"total_files_compacted": 12580,
"total_bytes_saved": 8589934592
}
}

Get Detailed Statistics

curl http://localhost:8000/api/compaction/stats \
-H "Authorization: Bearer YOUR_TOKEN"

List Eligible Partitions

curl http://localhost:8000/api/compaction/candidates \
-H "Authorization: Bearer YOUR_TOKEN"

Response:

{
"candidates": [
{
"partition": "default/cpu/2025/10/08/14",
"file_count": 150,
"total_size_mb": 7500,
"age_hours": 2.5,
"eligible": true
},
{
"partition": "default/mem/2025/10/08/14",
"file_count": 120,
"total_size_mb": 6000,
"age_hours": 2.5,
"eligible": true
}
],
"total_candidates": 2
}

Manually Trigger Compaction

curl -X POST http://localhost:8000/api/compaction/trigger \
-H "Authorization: Bearer YOUR_TOKEN"

View Active Jobs

curl http://localhost:8000/api/compaction/jobs \
-H "Authorization: Bearer YOUR_TOKEN"

View Job History

curl http://localhost:8000/api/compaction/history \
-H "Authorization: Bearer YOUR_TOKEN"

Performance Impact

Compaction Performance

Test Environment: Apple M3 Max (14 cores, 36GB RAM)

FilesSizeCompaction TimeFinal SizeCompression
8881.2 GB2.1s239 MB80.3%
9061.2 GB2.2s242 MB80.4%
9101.2 GB2.3s243 MB80.5%

Total: 2,704 files (3.7 GB) → 3 files (724 MB) in 6.6 seconds

Query Performance

Before Compaction:

SELECT * FROM cpu WHERE time > NOW() - INTERVAL 1 HOUR;
-- 5.2 seconds (scan 720 files)

After Compaction:

SELECT * FROM cpu WHERE time > NOW() - INTERVAL 1 HOUR;
-- 0.05 seconds (scan 1 file) - 104x faster!

Storage Savings

Original files (Snappy):  3.7 GB
Compacted files (ZSTD): 724 MB
Space saved: 80.4%

Best Practices

1. Let Compaction Run Automatically

The default schedule (hourly) works well for most use cases:

[compaction]
enabled = true
schedule = "5 * * * *"

2. Monitor Compaction Jobs

Set up alerts for:

  • Failed compaction jobs
  • Partitions with >1000 files
  • Compaction taking >10 minutes

3. Adjust Based on Write Volume

High volume (>10M records/sec):

[compaction]
min_files = 100 # Wait for more files
max_concurrent_jobs = 4 # More parallelism

Low volume (<100K records/sec):

[compaction]
min_files = 5 # Compact with fewer files
schedule = "0 */6 * * *" # Every 6 hours

4. Use Appropriate Target File Size

[compaction]
target_file_size_mb = 512 # Good default
# target_file_size_mb = 1024 # For very large datasets
# target_file_size_mb = 256 # For faster compaction

5. Reduce File Generation at Source

Best practice: Increase buffer sizes to generate fewer files:

[ingestion]
buffer_size = 200000 # Up from 50,000 (4x fewer files)
buffer_age_seconds = 10 # Up from 5 (2x fewer files)

Impact:

  • Files generated: 2,000/hour → 250/hour (8x reduction)
  • Compaction time: 150s → 20s (7x faster)
  • Memory usage: +300MB per worker

This is the most effective optimization - fewer files means faster compaction AND faster queries.

Troubleshooting

Compaction Not Running

Check status:

curl http://localhost:8000/api/compaction/status

Verify configuration:

# Check if enabled
grep "enabled" arc.conf

# Check schedule
grep "schedule" arc.conf

Check logs:

# Docker
docker-compose logs arc-api | grep compaction

# Native
tail -f logs/arc-api.log | grep compaction

Compaction Taking Too Long

Symptoms: Compaction jobs running for >30 minutes

Solutions:

  1. Reduce target file size:

    [compaction]
    target_file_size_mb = 256 # Smaller chunks
  2. Increase parallelism:

    [compaction]
    max_concurrent_jobs = 4
  3. Reduce files at source:

    [ingestion]
    buffer_size = 200000

Out of Disk Space During Compaction

Symptoms: Compaction fails with disk space errors

Solutions:

  1. Use temp directory on larger disk:

    export TMPDIR=/mnt/large-disk/tmp
  2. Reduce concurrent jobs:

    [compaction]
    max_concurrent_jobs = 1
  3. Clean up old compacted files manually:

    # Remove small files that were already compacted
    find ./data -name "*.parquet" -size -10M -delete

Compaction Locks Not Releasing

Symptoms: Partitions stuck in "locked" state

Check locks:

# View active locks
sqlite3 ./data/arc.db "SELECT * FROM compaction_locks;"

Clear stale locks:

# Locks expire automatically after 2 hours
# Or manually clear:
sqlite3 ./data/arc.db "DELETE FROM compaction_locks WHERE expires_at < datetime('now');"

API Reference

GET /api/compaction/status

Get current compaction status.

Response:

{
"enabled": true,
"running": false,
"last_run": "2025-10-08T14:05:00Z",
"next_run": "2025-10-08T15:05:00Z"
}

GET /api/compaction/stats

Get detailed compaction statistics.

GET /api/compaction/candidates

List partitions eligible for compaction.

POST /api/compaction/trigger

Manually trigger compaction.

Response:

{
"message": "Compaction triggered",
"job_id": "comp_1696775400"
}

GET /api/compaction/jobs

View active compaction jobs.

GET /api/compaction/history

View compaction job history.

Summary

Compaction is essential for production deployments:

Benefits:

  • 10-50x faster queries
  • 80% storage savings
  • 99% fewer API calls
  • Automatic and safe

Default configuration works for most cases:

[compaction]
enabled = true
schedule = "5 * * * *"
min_age_hours = 1
min_files = 10

Monitor regularly:

  • Check /api/compaction/status
  • Alert on failed jobs
  • Watch for partitions with >1000 files

Next Steps