SQL Querying Guide

Arc uses DuckDB as its SQL engine, giving you full analytical SQL capabilities on data stored as Parquet files.

SQL Syntax

Queries use the format database.measurement as the table name:

SELECT * FROM mydb.cpu LIMIT 10

If your database is named default, you can omit it:

SELECT * FROM default.cpu LIMIT 10

Query Endpoints

Endpoint	Response Format	Best For
`POST /api/v1/query`	JSON	Small results, debugging, dashboards
`POST /api/v1/query/arrow`	Apache Arrow IPC	Large results (6M+ rows/sec)
`GET /api/v1/query/:measurement`	JSON	Quick measurement queries

JSON Query

curl -X POST "http://localhost:8000/api/v1/query" \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT * FROM default.cpu WHERE time > NOW() - INTERVAL '\''1 hour'\'' LIMIT 100"}'

Arrow Query

For large result sets, Arrow IPC provides ~2x throughput vs JSON:

curl -X POST "http://localhost:8000/api/v1/query/arrow" \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT * FROM default.cpu LIMIT 1000000"}' \
  -o results.arrow

Time Filtering

Arc stores timestamps in the time column. Use standard SQL intervals for time-range queries:

-- Last hour
SELECT * FROM default.cpu
WHERE time > NOW() - INTERVAL '1 hour';

-- Last 7 days
SELECT * FROM default.cpu
WHERE time > NOW() - INTERVAL '7 days';

-- Specific date range
SELECT * FROM default.cpu
WHERE time BETWEEN '2026-01-01' AND '2026-01-31';

Partition Pruning

Time-range filters using the time column automatically trigger partition pruning, skipping Parquet files outside the range. Always include a time filter for best performance.

Time-Series Aggregation

time_bucket

Group data into fixed-size time intervals:

-- Hourly averages for the last 7 days
SELECT
  time_bucket('1 hour', time) AS bucket,
  AVG(cpu_usage) AS avg_cpu,
  MAX(cpu_usage) AS max_cpu,
  COUNT(*) AS samples
FROM default.cpu
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY bucket
ORDER BY bucket;

date_trunc

Truncate timestamps to calendar boundaries:

-- Daily summary for the last 30 days
SELECT
  date_trunc('day', time) AS day,
  host,
  AVG(cpu_usage) AS avg_cpu,
  AVG(mem_usage) AS avg_mem
FROM default.cpu
WHERE time > NOW() - INTERVAL '30 days'
GROUP BY day, host
ORDER BY day DESC, host;

Window Functions

Compute rolling metrics and detect anomalies:

-- 10-minute moving average with anomaly detection
SELECT
  time,
  host,
  cpu_usage,
  AVG(cpu_usage) OVER (
    PARTITION BY host
    ORDER BY time
    ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
  ) AS moving_avg,
  cpu_usage - AVG(cpu_usage) OVER (
    PARTITION BY host
    ORDER BY time
    ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
  ) AS deviation
FROM default.cpu
WHERE time > NOW() - INTERVAL '1 hour';

Common Table Expressions (CTEs)

Break complex queries into readable steps:

-- Find hosts with anomalous CPU spikes
WITH hourly_stats AS (
  SELECT
    host,
    time_bucket('1 hour', time) AS bucket,
    AVG(cpu_usage) AS avg_cpu,
    STDDEV(cpu_usage) AS std_cpu
  FROM default.cpu
  WHERE time > NOW() - INTERVAL '24 hours'
  GROUP BY host, bucket
),
anomalies AS (
  SELECT *
  FROM hourly_stats
  WHERE avg_cpu > 80 OR std_cpu > 20
)
SELECT host, bucket, avg_cpu, std_cpu
FROM anomalies
ORDER BY avg_cpu DESC;

Cross-Database Queries

Join data across databases and measurements:

-- Join CPU metrics with deployment events
SELECT
  c.time,
  c.host,
  c.cpu_usage,
  d.version
FROM production.cpu c
JOIN production.deployments d
  ON c.host = d.host
  AND c.time BETWEEN d.time AND d.time + INTERVAL '1 hour'
WHERE c.time > NOW() - INTERVAL '24 hours';

Useful DuckDB Functions

Arc supports all DuckDB functions. Here are the most useful for analytical queries:

Function	Description	Example
`NOW()`	Current timestamp	`WHERE time > NOW() - INTERVAL '1h'`
`time_bucket(interval, time)`	Fixed-size time buckets	`time_bucket('5 minutes', time)`
`date_trunc(part, time)`	Calendar truncation	`date_trunc('day', time)`
`epoch(time)`	Timestamp to epoch seconds	`epoch(time)`
`PERCENTILE_CONT(p)`	Percentile (continuous)	`PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency)`
`APPROX_QUANTILE(col, p)`	Approximate percentile (faster)	`APPROX_QUANTILE(latency, 0.99)`
`STDDEV(col)`	Standard deviation	`STDDEV(cpu_usage)`
`LAG(col) OVER (...)`	Previous row value	`LAG(value) OVER (ORDER BY time)`
`LEAD(col) OVER (...)`	Next row value	`LEAD(value) OVER (ORDER BY time)`

Performance Tips

Always filter by time -- Partition pruning skips entire Parquet files outside the range, often 10-100x faster.
Use Arrow for large results -- Arrow IPC provides ~2x throughput vs JSON for result sets over 100K rows.
Limit result sets -- Add LIMIT when exploring data. Scanning millions of rows without a limit is expensive.
Use aggregations server-side -- Compute AVG, COUNT, SUM in SQL rather than fetching raw rows and aggregating client-side.
Prefer APPROX_QUANTILE over PERCENTILE_CONT -- For large datasets, approximate percentiles are 10-100x faster.
Use time_bucket over date_trunc -- time_bucket supports arbitrary intervals (5 min, 15 min, 4 hours) while date_trunc is limited to calendar boundaries.

Next Steps

API Reference -- Full endpoint documentation
Python SDK Querying -- Query with pandas, polars, and PyArrow
Retention Policies -- Automatic data expiration
Continuous Queries -- Real-time aggregations and downsampling

SQL Syntax​

Query Endpoints​

JSON Query​

Arrow Query​

Time Filtering​

Time-Series Aggregation​

time_bucket​

date_trunc​

Window Functions​

Common Table Expressions (CTEs)​

Cross-Database Queries​

Useful DuckDB Functions​

Performance Tips​

Next Steps​