Continuous Queries in Arc: Automatic Downsampling That Just Works

You're ingesting CPU metrics every second from 500 hosts. That's 43 million rows per day. Your Grafana dashboard shows hourly averages. Every time someone opens that dashboard, DuckDB scans all 43 million rows just to compute 12,000 aggregated data points.

That's wasteful. Continuous queries fix it.

What Is a Continuous Query?

A continuous query (CQ) is a SQL query that runs on a schedule, reads from a source measurement, and writes pre-computed results to a destination measurement. You define the aggregation once, and Arc runs it automatically — every minute, every hour, whatever interval you choose.

cpu (raw, 1-second resolution, 43M rows/day)
  ↓ CQ runs every hour
cpu_hourly (aggregated, 1-hour resolution, 12K rows/day)

Your dashboards query cpu_hourly instead of cpu. Queries that used to scan millions of rows now scan thousands. The raw data stays around for ad-hoc deep dives, while the aggregated data serves 99% of operational monitoring.

If you've used InfluxDB's continuous queries or TimescaleDB's continuous aggregates, this is the same concept — adapted for Arc's Parquet-based architecture.

Creating a Continuous Query

Here's a real example. We want hourly CPU averages grouped by host:

curl -X POST http://localhost:8000/api/v1/continuous_queries \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "cpu_hourly_avg",
    "database": "production",
    "source_measurement": "cpu",
    "destination_measurement": "cpu_hourly",
    "query": "SELECT date_trunc('\''hour'\'', epoch_us(time)) AS time, host, AVG(usage_idle) AS avg_idle, AVG(usage_system) AS avg_system, MAX(usage_idle) AS max_idle, MIN(usage_idle) AS min_idle, COUNT(*) AS samples FROM {database}.{source_measurement} WHERE time >= {start_time} AND time < {end_time} GROUP BY 1, 2",
    "interval": "1h",
    "is_active": true
  }'

Let's break down what's happening here:

source_measurement: Where the raw data lives (cpu)
destination_measurement: Where the aggregated results go (cpu_hourly)
query: Standard DuckDB SQL with three special placeholders
interval: How often the query runs (1h, 5m, 30s, etc.)

The three placeholders — {database}, {source_measurement}, {start_time}, {end_time} — are replaced at execution time. The first two resolve to the correct Parquet file paths. The last two define the time window being processed.

You write SQL. Arc handles the plumbing.

The Time Window Machine

Here's the part that makes CQs reliable: Arc tracks what it's already processed.

Every CQ has a last_processed_time field. When a CQ runs, it uses this as the start of the time window:

Execution 1:  start_time = (1 hour ago)    →  end_time = now
              last_processed_time saved as end_time

Execution 2:  start_time = last_processed_time  →  end_time = now
              last_processed_time updated again

Execution 3:  start_time = last_processed_time  →  end_time = now
              ...and so on

This means:

No gaps: Each execution picks up exactly where the last one left off
No overlap: The start of the current window is the end of the previous one
Crash-safe: If Arc restarts, the next execution resumes from last_processed_time

The state update is atomic — the execution record and the last_processed_time update happen in a single SQLite transaction. If the server crashes between DuckDB query and state update, the worst case is re-processing one window (which produces the same aggregated result — idempotent by nature of the aggregation).

What Happens Under the Hood

When a CQ executes, here's the actual sequence:

1. Time window calculation — Read last_processed_time from SQLite, set as start_time. Set end_time to now.

2. SQL transformation — Your query template gets the placeholders replaced:

-- Your template:
SELECT date_trunc('hour', epoch_us(time)) AS time, host, AVG(usage_idle) AS avg_idle
FROM {database}.{source_measurement}
WHERE time >= {start_time} AND time < {end_time}
GROUP BY 1, 2
 
-- Becomes:
SELECT date_trunc('hour', epoch_us(time)) AS time, host, AVG(usage_idle) AS avg_idle
FROM read_parquet('s3://bucket/production/cpu/**/*.parquet', union_by_name=true)
WHERE time >= '2026-04-15T14:00:00Z' AND time < '2026-04-15T15:00:00Z'
GROUP BY 1, 2

Notice that {database}.{source_measurement} gets replaced with a read_parquet() call pointing at the actual storage path — whether that's local disk, S3, or Azure Blob. The union_by_name=true handles schema evolution (if your source measurement gained new columns over time).

If your query uses CTEs (WITH ... AS), Arc detects them and doesn't accidentally replace CTE names with read_parquet() paths. You can write complex multi-step queries without worrying about collisions.

3. DuckDB execution — The transformed query runs against DuckDB. This is a full analytical query — window functions, joins, subqueries, anything DuckDB supports.

4. Result ingestion — The query results are converted to columnar format and written to the destination measurement via Arc's ArrowBuffer. The data goes through the same ingestion pipeline as regular writes — buffered, flushed to Parquet, compacted later.

5. State update — In a single SQLite transaction: insert the execution record (status, duration, records written) and update last_processed_time.

Dry Run: Test Before You Commit

Complex aggregation queries deserve a test run. Dry run mode executes the query and shows you what would be written — without actually writing:

curl -X POST http://localhost:8000/api/v1/continuous_queries/1/execute \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "start_time": "2026-04-15T00:00:00Z",
    "end_time": "2026-04-15T01:00:00Z",
    "dry_run": true
  }'

The response includes the fully-resolved SQL query that would have been executed, so you can copy it into your query tool and debug if needed.

Automatic Scheduling

Manual execution is useful for testing and backfills, but the real value is automatic scheduling. When a CQ is set to is_active: true with an interval, Arc's CQ scheduler picks it up:

CQ "cpu_hourly_avg" (interval: 1h)
  → Ticker fires every hour
  → Check license (enterprise feature)
  → Execute CQ with automatic time window
  → Log results to execution history
  → Repeat

The scheduler enforces a minimum interval of 10 seconds — we don't want someone accidentally setting interval: "100ms" and hammering their storage backend. Each execution has a 10-minute timeout to prevent runaway queries from blocking the scheduler.

If a CQ is updated (say you change the interval from 1h to 30m), the scheduler automatically restarts its ticker. No manual restart required.

Note: Automatic scheduling requires an enterprise license. Creating CQs and running them manually is available in the community edition.

Practical Examples

Downsampling IoT sensor data

Factory sensors report temperature every second. You want 5-minute averages for dashboards and 1-hour averages for weekly reports:

# 5-minute aggregation
curl -X POST http://localhost:8000/api/v1/continuous_queries \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "temp_5m",
    "database": "factory",
    "source_measurement": "temperature",
    "destination_measurement": "temperature_5m",
    "query": "SELECT date_trunc('\''minute'\'', epoch_us(time)) - (EXTRACT(minute FROM epoch_us(time))::INTEGER % 5) * INTERVAL '\''1 minute'\'' AS time, sensor_id, zone, AVG(value) AS avg_temp, MAX(value) AS max_temp, MIN(value) AS min_temp FROM {database}.{source_measurement} WHERE time >= {start_time} AND time < {end_time} GROUP BY 1, 2, 3",
    "interval": "5m",
    "is_active": true
  }'

Pre-computing request latency percentiles

API monitoring with P50/P95/P99 percentiles, computed hourly:

curl -X POST http://localhost:8000/api/v1/continuous_queries \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "latency_percentiles_1h",
    "database": "monitoring",
    "source_measurement": "requests",
    "destination_measurement": "requests_hourly",
    "query": "SELECT date_trunc('\''hour'\'', epoch_us(time)) AS time, endpoint, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms) AS p50, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99, COUNT(*) AS total_requests FROM {database}.{source_measurement} WHERE time >= {start_time} AND time < {end_time} GROUP BY 1, 2",
    "interval": "1h",
    "is_active": true
  }'

Because the query runs on DuckDB, you get the full power of a modern analytical engine — PERCENTILE_CONT, window functions, CASE WHEN, joins between measurements, anything you need.

Backfilling historical data

Already have a month of raw data and want to create aggregations retroactively? Use manual execution with an explicit time range:

curl -X POST http://localhost:8000/api/v1/continuous_queries/1/execute \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "start_time": "2026-03-01T00:00:00Z",
    "end_time": "2026-04-01T00:00:00Z",
    "dry_run": false
  }'

This processes the entire month in one shot. The aggregated data appears in the destination measurement immediately.

Monitoring Execution History

Every CQ execution is logged with full metadata:

curl http://localhost:8000/api/v1/continuous_queries/1/executions \
  -H "Authorization: Bearer $TOKEN"

{
  "executions": [
    {
      "execution_id": "a1b2c3d4",
      "execution_time": "2026-04-15T15:00:02Z",
      "status": "completed",
      "records_read": 1800000,
      "records_written": 500,
      "duration_seconds": 2.34,
      "start_time": "2026-04-15T14:00:00Z",
      "end_time": "2026-04-15T15:00:00Z"
    }
  ]
}

1.8 million raw rows distilled into 500 aggregated rows in 2.3 seconds. That's the kind of reduction ratio that makes dashboards fast and storage bills reasonable.

The Full Data Lifecycle

CQs fit into Arc's broader data management story. Here's the complete lifecycle for a production setup:

Raw data (1s resolution)
  → Ingested via ArrowBuffer → Parquet files
  → CQ runs hourly: raw → aggregated (cpu_hourly)
  → Compaction merges small files into large ones
  → Tiering moves old data to S3 (cold tier)
  → Retention deletes raw data after 7 days
  → Retention deletes aggregated data after 90 days

You can even automate the raw data cleanup with delete_source_after_days on the CQ itself. Set it to 7, and Arc will delete source data older than 7 days after each CQ execution — so you only keep the high-resolution data for a week, while the hourly aggregations stick around for months.

Why This Design

A few decisions worth explaining:

Why SQL templates instead of a DSL? Because DuckDB's SQL is already more expressive than any DSL we could design. Window functions, percentiles, joins, CTEs — you get all of it for free. We just handle the time windowing and storage path resolution.

Why write to ArrowBuffer instead of directly to Parquet? Because ArrowBuffer handles buffering, WAL durability, and flush scheduling. CQ results go through the same pipeline as regular ingestion, which means they get compacted, tiered, and managed like any other data.

Why SQLite for CQ state? CQ definitions are small, slow-changing metadata. SQLite's ACID transactions give us crash-safe state tracking without the complexity of a distributed system. The last_processed_time update is a single row update in a local file — microseconds, not milliseconds.

Why a minimum 10-second interval? Because someone will set it to 100ms and wonder why their CPU is at 100%. The minimum prevents accidental self-inflicted denial of service while still allowing sub-minute aggregations for latency-sensitive use cases.

Continuous queries are one of those features that seem simple on the surface — run a query on a schedule, write the results — but the details matter. Time window tracking, crash safety, schema evolution, CTE handling, dry runs. Get any of those wrong and you're either losing data or duplicating it.

We got them right. Give it a try.

Analytical Database

Streaming

AI Memory