SQL Querying Guide
Arc uses DuckDB as its SQL engine, giving you full analytical SQL capabilities on data stored as Parquet files.
SQL Syntax
Queries use the format database.measurement as the table name:
SELECT * FROM mydb.cpu LIMIT 10
If your database is named default, you can omit it:
SELECT * FROM default.cpu LIMIT 10
Query Endpoints
| Endpoint | Response Format | Best For |
|---|---|---|
POST /api/v1/query | JSON | Small results, debugging, dashboards |
POST /api/v1/query/arrow | Apache Arrow IPC | Large results (6M+ rows/sec) |
GET /api/v1/query/:measurement | JSON | Quick measurement queries |
JSON Query
curl -X POST "http://localhost:8000/api/v1/query" \
-H "Authorization: Bearer $ARC_TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT * FROM default.cpu WHERE time > NOW() - INTERVAL '\''1 hour'\'' LIMIT 100"}'
Arrow Query
For large result sets, Arrow IPC provides ~2x throughput vs JSON:
curl -X POST "http://localhost:8000/api/v1/query/arrow" \
-H "Authorization: Bearer $ARC_TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT * FROM default.cpu LIMIT 1000000"}' \
-o results.arrow
Time Filtering
Arc stores timestamps in the time column. Use standard SQL intervals for time-range queries:
-- Last hour
SELECT * FROM default.cpu
WHERE time > NOW() - INTERVAL '1 hour';
-- Last 7 days
SELECT * FROM default.cpu
WHERE time > NOW() - INTERVAL '7 days';
-- Specific date range
SELECT * FROM default.cpu
WHERE time BETWEEN '2026-01-01' AND '2026-01-31';
Time-range filters using the time column automatically trigger partition pruning, skipping Parquet files outside the range. Always include a time filter for best performance.
Time-Series Aggregation
time_bucket
Group data into fixed-size time intervals:
-- Hourly averages for the last 7 days
SELECT
time_bucket('1 hour', time) AS bucket,
AVG(cpu_usage) AS avg_cpu,
MAX(cpu_usage) AS max_cpu,
COUNT(*) AS samples
FROM default.cpu
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY bucket
ORDER BY bucket;
date_trunc
Truncate timestamps to calendar boundaries:
-- Daily summary for the last 30 days
SELECT
date_trunc('day', time) AS day,
host,
AVG(cpu_usage) AS avg_cpu,
AVG(mem_usage) AS avg_mem
FROM default.cpu
WHERE time > NOW() - INTERVAL '30 days'
GROUP BY day, host
ORDER BY day DESC, host;
Window Functions
Compute rolling metrics and detect anomalies:
-- 10-minute moving average with anomaly detection
SELECT
time,
host,
cpu_usage,
AVG(cpu_usage) OVER (
PARTITION BY host
ORDER BY time
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
) AS moving_avg,
cpu_usage - AVG(cpu_usage) OVER (
PARTITION BY host
ORDER BY time
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
) AS deviation
FROM default.cpu
WHERE time > NOW() - INTERVAL '1 hour';
Common Table Expressions (CTEs)
Break complex queries into readable steps:
-- Find hosts with anomalous CPU spikes
WITH hourly_stats AS (
SELECT
host,
time_bucket('1 hour', time) AS bucket,
AVG(cpu_usage) AS avg_cpu,
STDDEV(cpu_usage) AS std_cpu
FROM default.cpu
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY host, bucket
),
anomalies AS (
SELECT *
FROM hourly_stats
WHERE avg_cpu > 80 OR std_cpu > 20
)
SELECT host, bucket, avg_cpu, std_cpu
FROM anomalies
ORDER BY avg_cpu DESC;
Cross-Database Queries
Join data across databases and measurements:
-- Join CPU metrics with deployment events
SELECT
c.time,
c.host,
c.cpu_usage,
d.version
FROM production.cpu c
JOIN production.deployments d
ON c.host = d.host
AND c.time BETWEEN d.time AND d.time + INTERVAL '1 hour'
WHERE c.time > NOW() - INTERVAL '24 hours';
Useful DuckDB Functions
Arc supports all DuckDB functions. Here are the most useful for analytical queries:
| Function | Description | Example |
|---|---|---|
NOW() | Current timestamp | WHERE time > NOW() - INTERVAL '1h' |
time_bucket(interval, time) | Fixed-size time buckets | time_bucket('5 minutes', time) |
date_trunc(part, time) | Calendar truncation | date_trunc('day', time) |
epoch(time) | Timestamp to epoch seconds | epoch(time) |
PERCENTILE_CONT(p) | Percentile (continuous) | PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency) |
APPROX_QUANTILE(col, p) | Approximate percentile (faster) | APPROX_QUANTILE(latency, 0.99) |
STDDEV(col) | Standard deviation | STDDEV(cpu_usage) |
LAG(col) OVER (...) | Previous row value | LAG(value) OVER (ORDER BY time) |
LEAD(col) OVER (...) | Next row value | LEAD(value) OVER (ORDER BY time) |
Performance Tips
-
Always filter by time -- Partition pruning skips entire Parquet files outside the range, often 10-100x faster.
-
Use Arrow for large results -- Arrow IPC provides ~2x throughput vs JSON for result sets over 100K rows.
-
Limit result sets -- Add
LIMITwhen exploring data. Scanning millions of rows without a limit is expensive. -
Use aggregations server-side -- Compute
AVG,COUNT,SUMin SQL rather than fetching raw rows and aggregating client-side. -
Prefer
APPROX_QUANTILEoverPERCENTILE_CONT-- For large datasets, approximate percentiles are 10-100x faster. -
Use
time_bucketoverdate_trunc--time_bucketsupports arbitrary intervals (5 min, 15 min, 4 hours) whiledate_truncis limited to calendar boundaries.
Next Steps
- API Reference -- Full endpoint documentation
- Python SDK Querying -- Query with pandas, polars, and PyArrow
- Retention Policies -- Automatic data expiration
- Continuous Queries -- Real-time aggregations and downsampling