How Parquet Compaction Works in Arc (and Why It Runs in a Subprocess)

Every analytical database has a dirty secret: ingestion and query performance are at odds. Fast ingestion wants small, frequent writes. Fast queries want large, well-organized files. Arc's compaction system is the bridge between those two worlds.

Let me walk you through how it actually works.

The Problem: Death by a Thousand Small Files

Arc flushes in-memory buffers to Parquet files roughly every 5 seconds (configurable via max_buffer_age_ms). That's great for durability — your data hits disk fast — but it means a busy measurement can produce hundreds of tiny files per hour.

Here's what a single hour of ingestion looks like on disk:

metrics/cpu/2026/04/15/14/cpu_20260415_140005_123456789.parquet   (42 KB)
metrics/cpu/2026/04/15/14/cpu_20260415_140010_987654321.parquet   (38 KB)
metrics/cpu/2026/04/15/14/cpu_20260415_140015_456789123.parquet   (41 KB)
... (100+ more files)

Each file is valid Parquet with proper metadata, compression, and schema — but when you query SELECT * FROM metrics.cpu WHERE time > now() - INTERVAL '24 hours', DuckDB has to open, read metadata, and scan every single one. That's a lot of filesystem overhead for what should be a fast read.

Compaction fixes this.

Two Tiers: Hourly and Daily

We use a two-tier compaction model. It's not clever for the sake of being clever — it's the natural shape of the problem.

Hourly compaction runs every hour (default: at :05 past the hour). It takes all the small files from a single hour partition and merges them into one:

# Before (100+ files, ~4 MB total)
metrics/cpu/2026/04/15/14/cpu_20260415_140005_*.parquet
metrics/cpu/2026/04/15/14/cpu_20260415_140010_*.parquet
...

# After (1 file, ~4 MB, properly sorted)
metrics/cpu/2026/04/15/14/cpu_20260415_140500_compacted.parquet

Daily compaction runs once per day (default: 3 AM). It takes the hourly-compacted files from an entire day and merges them into a single large file:

# Before (24 hourly files, ~100 MB total)
metrics/cpu/2026/04/15/00/cpu_*_compacted.parquet
metrics/cpu/2026/04/15/01/cpu_*_compacted.parquet
...
metrics/cpu/2026/04/15/23/cpu_*_compacted.parquet

# After (1 file, ~100 MB, one per day)
metrics/cpu/2026/04/15/cpu_20260416_030000_daily.parquet

The result: recent data lives in a few hourly files (good for time-range queries on today's data), while historical data is consolidated into daily files (good for analytical queries spanning weeks or months).

What a Compaction Job Actually Does

Each job runs a 5-phase pipeline. Here's the condensed version:

1. Download — Parallel download (4 workers) from storage to a local temp directory. This works whether your data lives on local disk, S3, or Azure Blob.

2. Validate — Quick Parquet magic byte check (reads 8 bytes total: PAR1 at head and tail). We don't use DuckDB's read_parquet() for validation because that loads the entire file into memory — fine for one file, catastrophic for hundreds.

3. Merge — This is where DuckDB does the heavy lifting:

COPY (
  SELECT * FROM read_parquet(
    ['file1.parquet', 'file2.parquet', ...],
    union_by_name=true
  )
  ORDER BY time
) TO 'output_compacted.parquet' (
  FORMAT PARQUET,
  COMPRESSION ZSTD,
  COMPRESSION_LEVEL 3,
  ROW_GROUP_SIZE 122880
)

A few things worth noting here:

union_by_name=true handles schema evolution. If your measurement gained a new column last Tuesday, the older files just get NULL for that column. No migration needed.
ORDER BY time (or your custom sort keys) ensures the output is sorted for optimal query-time filtering.
ROW_GROUP_SIZE 122880 (~120K rows per group) balances DuckDB's internal read granularity with memory usage.
ZSTD at level 3 gives us excellent compression ratios without burning CPU.

4. Upload — Stream the compacted file back to storage.

5. Delete — Remove the original files. Only files that passed validation in step 2 are deleted — we never delete what we didn't successfully compact.

The Subprocess Trick

Here's something that took us a while to figure out: DuckDB uses jemalloc as its memory allocator. Jemalloc is fast and great at reducing fragmentation, but it has a quirk — it doesn't fully return memory to the OS when you close a connection. The arenas stay allocated for reuse.

In a long-running API server, this means each compaction cycle leaves behind a memory footprint. After 50 compactions, your process is sitting on gigabytes of allocated-but-unused memory. In a container with memory limits, that's an OOM kill waiting to happen.

Our fix: each compaction job runs in a subprocess.

arc (main process) → spawns → arc compact --job-stdin (subprocess)
                                  ↓
                            fresh DuckDB connection
                            runs 5-phase pipeline
                            exits → all memory returned to OS

The parent process serializes the job config as JSON to stdin, the subprocess runs the compaction, writes the result to stdout, and exits. When the process exits, the OS reclaims everything — jemalloc arenas, DuckDB buffers, Arrow arrays, all of it.

This also means an OOM kill on a compaction job doesn't crash your API server. The parent just logs the failure and moves on to the next partition.

Auto-Dedup: Free Duplicate Removal

When you ingest data with tag columns (like host, region, device_id), Arc stores that metadata in the Parquet file footer. During compaction, if tag metadata is present, we automatically deduplicate:

SELECT * EXCLUDE (__dedup_rn) FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY "host", "region", "time"
    ORDER BY "time" DESC
  ) AS __dedup_rn
  FROM read_parquet([...], union_by_name=true)
) WHERE __dedup_rn = 1
ORDER BY time

Last-write-wins semantics. If the same (host, region, time) tuple appears in multiple files (from retries, late-arriving data, or overlapping ingestion windows), only the newest row survives. This happens silently — no config needed.

Crash Recovery via Manifests

What happens if Arc crashes mid-compaction? We write a manifest file before uploading the compacted output:

_compaction_state/hourly/metrics/cpu_2026_04_15_14_{jobID}.json

The manifest records which input files are being compacted and what the output path will be. If the process crashes:

Before upload: Manifest exists but output doesn't. Next cycle ignores the manifest, retries from scratch.
After upload, before delete: Manifest + output both exist. Next cycle verifies the output, completes the deletion of input files.
After delete: Manifest is cleaned up. Done.

This makes compaction idempotent. You can crash at any point and restart without duplicating or losing data.

Configuration

Here's a practical config with the most useful knobs:

compaction:
  hourly_schedule: "5 * * * *"       # Every hour at :05
  daily_schedule: "0 3 * * *"        # 3 AM daily
  hourly_min_files: 10               # Don't compact until 10+ files
  daily_min_files: 12                # Need ~half a day of hourly files
  hourly_min_age_hours: 1            # Wait for ingestion to finish
  daily_min_age_hours: 24            # Wait for the full day
  max_concurrent: 2                  # Parallel jobs (CPU-bound)
  daily_skip_file_age_check_days: 7  # Skip freshness check for old data
 
ingest:
  sort_keys:
    - "cpu:host,time"                # Sort cpu data by host, then time
    - "network:interface,time"       # Sort network data by interface, then time
  default_sort_keys: "time"          # Everything else: just sort by time

API: Manual Triggers and Monitoring

You don't have to wait for the scheduler. Trigger a compaction manually:

curl -X POST http://localhost:8000/api/v1/compaction/trigger \
  -H "Authorization: Bearer $TOKEN" \
  -d 'tier=hourly&database=metrics'

Check what's happening:

# Current status
curl http://localhost:8000/api/v1/compaction/status
 
# Historical stats
curl http://localhost:8000/api/v1/compaction/stats
 
# What's eligible for compaction right now
curl http://localhost:8000/api/v1/compaction/candidates

The /candidates endpoint is particularly useful — it scans storage and shows you which partitions have enough files to compact, so you can decide whether to trigger manually or wait for the scheduler.

Why This Matters

Without compaction, a measurement ingesting 10,000 records/second would create ~720 files per hour. After a week, that's over 120,000 files per measurement. Queries slow to a crawl, S3 LIST calls become expensive, and your cloud bill starts looking grim.

With two-tier compaction, that same week ends up as 7 daily files. Query performance stays flat regardless of how long you've been ingesting.

The subprocess model means compaction never impacts your query latency. The manifest system means it never loses data. And auto-dedup means you don't have to worry about retried ingestion creating duplicates.

It's not the most glamorous part of a database, but it's the part that keeps everything working at scale.

Analytical Database

Streaming

AI Memory