Skip to main content

Query Caching

Arc implements multiple caching layers to optimize query performance, particularly for dashboard and monitoring use cases where the same queries are executed repeatedly.

Cache Layers

Arc uses three complementary caches that work together:

CacheTTLPurposeSavings
SQL Transform Cache60sCaches SQL-to-storage-path conversions49-104x speedup
Partition Path Cache60sCaches OptimizeTablePath() results50-100ms/query
Glob Cache30sCaches filesystem glob results5-10ms/query

SQL Transform Cache

When you execute a query like:

SELECT * FROM mydb.cpu WHERE time > now() - INTERVAL '1 hour'

Arc converts the table reference mydb.cpu to a DuckDB read_parquet() call:

SELECT * FROM read_parquet('./data/mydb/cpu/**/*.parquet') WHERE time > now() - INTERVAL '1 hour'

This string transformation uses regex matching and happens on every query. The SQL Transform Cache stores the result so repeated queries skip this processing.

Performance Impact

ScenarioTimeSpeedup
Without cache (first query)13-37μs-
With cache (repeated query)~300ns49-104x

When It Helps

The SQL Transform Cache is most beneficial for:

  • Dashboard refresh: Same queries every 30s-5min
  • Monitoring alerts: Repeated threshold checks
  • API integrations: Clients polling the same metrics
  • Multi-user dashboards: Shared queries across users

Cache Behavior

  • Key: SHA256 hash of the raw SQL string
  • TTL: 60 seconds (matches partition cache)
  • Max entries: 10,000 queries
  • Eviction: Expired entries removed first, then oldest

Partition Path Cache

After SQL transformation, Arc optimizes the storage path by applying time-based partition pruning. This cache stores the optimized paths.

Example

Query with time filter:

SELECT * FROM mydb.cpu WHERE time > 1704067200000000

Without cache: Scans partition metadata to find relevant directories. With cache: Returns pre-computed path like ./data/mydb/cpu/2024/01/**/*.parquet.

Performance Impact

Saves 50-100ms per query on large datasets with many partitions.

Glob Cache

After determining the partition path, Arc uses filesystem globs to find matching Parquet files. The Glob Cache stores these file listings.

Performance Impact

Saves 5-10ms per query by avoiding repeated filesystem operations.

Cache Statistics

Monitor cache performance via the pruner stats:

stats := pruner.GetAllCacheStats()

Returns:

{
"partition_cache": {
"size": 150,
"hits": 12847,
"misses": 423,
"hit_rate_percent": 96.8
},
"glob_cache": {
"size": 89,
"hits": 8234,
"misses": 312,
"hit_rate_percent": 96.3
}
}

Best Practices

1. Use Consistent Query Strings

Cache keys are based on exact SQL text. These are different cache entries:

SELECT * FROM mydb.cpu WHERE time > 1704067200000000
SELECT * FROM mydb.cpu WHERE time > 1704067200000000 -- extra space
select * from mydb.cpu where time > 1704067200000000 -- lowercase

Normalize your queries for better cache hit rates.

2. Use Parameterized Time Ranges

For dashboard queries, use relative time:

-- Good: Same query text on every refresh
SELECT * FROM mydb.cpu WHERE time > now() - INTERVAL '1 hour'

-- Less efficient: Different timestamp each time
SELECT * FROM mydb.cpu WHERE time > 1704067200000000

3. Monitor Hit Rates

Healthy dashboards should see 60-80%+ cache hit rates. Low hit rates may indicate:

  • Too many unique queries
  • Query text variations
  • TTL too short for your refresh interval

Configuration

Cache parameters are currently fixed but tuned for typical workloads:

ParameterValueRationale
SQL Transform TTL60sCovers 1-2 dashboard refresh cycles
SQL Transform Max Size10,000Handles large multi-tenant deployments
Partition Cache TTL60sBalance freshness vs. performance
Glob Cache TTL30sFiles change less frequently

Cache Invalidation

Caches automatically expire based on TTL. Manual invalidation happens when:

  • New data is ingested (invalidates partition/glob caches for affected measurements)
  • Compaction runs (file paths change)

The SQL Transform Cache is not invalidated by data changes since the transformation logic doesn't depend on data content.

Technical Details

Thread Safety

All caches use sync.RWMutex for concurrent access:

  • Multiple readers allowed
  • Exclusive write lock for updates
  • Lock-free atomic counters for hit/miss tracking

Memory Usage

Approximate memory per cache:

CacheEntry SizeMax EntriesMax Memory
SQL Transform~500 bytes10,000~5MB
Partition Path~200 bytes1,000~200KB
Glob~1KB1,000~1MB

Total cache overhead: ~6MB typical, ~10MB maximum.

Next Steps