Query Caching

Arc implements multiple caching layers to optimize query performance, particularly for dashboard and monitoring use cases where the same queries are executed repeatedly.

Cache Layers

Arc uses three complementary caches that work together:

Cache	TTL	Purpose	Savings
SQL Transform Cache	60s	Caches SQL-to-storage-path conversions	49-104x speedup
Partition Path Cache	60s	Caches `OptimizeTablePath()` results	50-100ms/query
Glob Cache	30s	Caches filesystem glob results	5-10ms/query

SQL Transform Cache

When you execute a query like:

SELECT * FROM mydb.cpu WHERE time > now() - INTERVAL '1 hour'

Arc converts the table reference mydb.cpu to a DuckDB read_parquet() call:

SELECT * FROM read_parquet('./data/mydb/cpu/**/*.parquet') WHERE time > now() - INTERVAL '1 hour'

This string transformation uses regex matching and happens on every query. The SQL Transform Cache stores the result so repeated queries skip this processing.

Performance Impact

Scenario	Time	Speedup
Without cache (first query)	13-37μs	-
With cache (repeated query)	~300ns	49-104x

When It Helps

The SQL Transform Cache is most beneficial for:

Dashboard refresh: Same queries every 30s-5min
Monitoring alerts: Repeated threshold checks
API integrations: Clients polling the same metrics
Multi-user dashboards: Shared queries across users

Cache Behavior

Key: SHA256 hash of the raw SQL string
TTL: 60 seconds (matches partition cache)
Max entries: 10,000 queries
Eviction: Expired entries removed first, then oldest

Partition Path Cache

After SQL transformation, Arc optimizes the storage path by applying time-based partition pruning. This cache stores the optimized paths.

Example

Query with time filter:

SELECT * FROM mydb.cpu WHERE time > 1704067200000000

Without cache: Scans partition metadata to find relevant directories. With cache: Returns pre-computed path like ./data/mydb/cpu/2024/01/**/*.parquet.

Performance Impact

Saves 50-100ms per query on large datasets with many partitions.

Glob Cache

After determining the partition path, Arc uses filesystem globs to find matching Parquet files. The Glob Cache stores these file listings.

Performance Impact

Saves 5-10ms per query by avoiding repeated filesystem operations.

Cache Statistics

Monitor cache performance via the pruner stats:

stats := pruner.GetAllCacheStats()

Returns:

{
  "partition_cache": {
    "size": 150,
    "hits": 12847,
    "misses": 423,
    "hit_rate_percent": 96.8
  },
  "glob_cache": {
    "size": 89,
    "hits": 8234,
    "misses": 312,
    "hit_rate_percent": 96.3
  }
}

Best Practices

1. Use Consistent Query Strings

Cache keys are based on exact SQL text. These are different cache entries:

SELECT * FROM mydb.cpu WHERE time > 1704067200000000
SELECT *  FROM mydb.cpu WHERE time > 1704067200000000  -- extra space
select * from mydb.cpu where time > 1704067200000000   -- lowercase

Normalize your queries for better cache hit rates.

2. Use Parameterized Time Ranges

For dashboard queries, use relative time:

-- Good: Same query text on every refresh
SELECT * FROM mydb.cpu WHERE time > now() - INTERVAL '1 hour'

-- Less efficient: Different timestamp each time
SELECT * FROM mydb.cpu WHERE time > 1704067200000000

3. Monitor Hit Rates

Healthy dashboards should see 60-80%+ cache hit rates. Low hit rates may indicate:

Too many unique queries
Query text variations
TTL too short for your refresh interval

Configuration

Cache parameters are currently fixed but tuned for typical workloads:

Parameter	Value	Rationale
SQL Transform TTL	60s	Covers 1-2 dashboard refresh cycles
SQL Transform Max Size	10,000	Handles large multi-tenant deployments
Partition Cache TTL	60s	Balance freshness vs. performance
Glob Cache TTL	30s	Files change less frequently

Cache Invalidation

Caches automatically expire based on TTL. Manual invalidation happens when:

New data is ingested (invalidates partition/glob caches for affected measurements)
Compaction runs (file paths change)

The SQL Transform Cache is not invalidated by data changes since the transformation logic doesn't depend on data content.

Technical Details

Thread Safety

All caches use sync.RWMutex for concurrent access:

Multiple readers allowed
Exclusive write lock for updates
Lock-free atomic counters for hit/miss tracking

Memory Usage

Approximate memory per cache:

Cache	Entry Size	Max Entries	Max Memory
SQL Transform	~500 bytes	10,000	~5MB
Partition Path	~200 bytes	1,000	~200KB
Glob	~1KB	1,000	~1MB

Total cache overhead: ~6MB typical, ~10MB maximum.

Next Steps

Compaction - Optimize query performance through file merging
WAL - Write-ahead log for durability
Performance Benchmarks - Benchmark methodology and results

Cache Layers​

SQL Transform Cache​

Performance Impact​

When It Helps​

Cache Behavior​

Partition Path Cache​

Example​

Performance Impact​

Glob Cache​

Performance Impact​

Cache Statistics​

Best Practices​

1. Use Consistent Query Strings​

2. Use Parameterized Time Ranges​

3. Monitor Hit Rates​

Configuration​

Cache Invalidation​

Technical Details​

Thread Safety​

Memory Usage​

Next Steps​

Cache Layers

SQL Transform Cache

Performance Impact

When It Helps

Cache Behavior

Partition Path Cache

Example

Performance Impact

Glob Cache

Performance Impact

Cache Statistics

Best Practices

1. Use Consistent Query Strings

2. Use Parameterized Time Ranges

3. Monitor Hit Rates

Configuration

Cache Invalidation

Technical Details

Thread Safety

Memory Usage

Next Steps