Retention Policies in Arc: How We Delete Data Without Rewriting Files

Data accumulates. That's what databases do. The problem is, most of the data you ingest today will be irrelevant in 30 days. Maybe 90. Maybe a week, if it's high-frequency sensor telemetry. Without retention policies, your storage bill grows linearly forever and your queries slow down scanning data nobody will ever look at again.

Arc's retention system handles this, and it does it differently from most databases. Let me explain how — and why.

The Core Idea: Delete Files, Not Rows

Traditional databases delete rows. They scan a table, mark rows as deleted, and eventually vacuum or compact the remaining data. It's expensive, it holds locks, and it generates write amplification.

Arc doesn't do that. Arc stores data in immutable Parquet files, partitioned by hour:

metrics/cpu/2026/04/15/00/cpu_20260415_000500_compacted.parquet
metrics/cpu/2026/04/15/01/cpu_20260415_010500_compacted.parquet
...
metrics/cpu/2026/04/15/23/cpu_20260415_230500_compacted.parquet

Each file contains data for a specific hour. When a retention policy says "delete data older than 30 days," Arc doesn't scan rows — it checks whether the newest timestamp in each file is older than the cutoff. If it is, the entire file gets deleted.

That's it. No rewriting, no vacuuming, no row-level scanning. DELETE is a metadata operation that maps to os.Remove() or an S3 DeleteObject call.

This works because Arc's hourly partitioning creates natural retention boundaries. Once an hour's worth of data ages past your retention window, the whole file goes. Clean, fast, and predictable.

Buffer Days: Grace Period for Reality

IoT devices lose connectivity. Network partitions delay data. Edge gateways batch and forward hours later. Late-arriving data is a fact of life, not an edge case.

That's why every retention policy has a buffer_days setting (default: 7 days). The actual cutoff date is:

cutoff = now - retention_days - buffer_days

So a 30-day retention policy with 7-day buffer won't delete data until it's 37 days old. The buffer ensures late-arriving data that lands in an "old" partition doesn't get immediately nuked.

You can tune this per policy. Real-time monitoring data with reliable delivery? Set buffer_days: 1. Industrial sensors behind flaky satellite links? Maybe buffer_days: 14.

Creating a Retention Policy

Policies are managed via REST API. Here's how you create one:

curl -X POST http://localhost:8000/api/v1/retention/ \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "metrics-30day",
    "database": "production",
    "retention_days": 30,
    "buffer_days": 7,
    "is_active": true
  }'

This policy applies to all measurements in the production database. If you want to target a specific measurement (because maybe your audit_log measurement needs 365-day retention while cpu only needs 30):

curl -X POST http://localhost:8000/api/v1/retention/ \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "audit-1year",
    "database": "production",
    "measurement": "audit_log",
    "retention_days": 365,
    "buffer_days": 7,
    "is_active": true
  }'

When measurement is null, Arc discovers all measurements in the database by scanning the storage backend. When it's specified, only that measurement is processed.

Dry Run: Look Before You Leap

Deleting data is permanent. We built dry run mode so you can preview exactly what a policy execution would do without touching anything:

curl -X POST http://localhost:8000/api/v1/retention/1/execute \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"dry_run": true}'

Response:

{
  "policy_name": "metrics-30day",
  "dry_run": true,
  "deleted_count": 45000000,
  "files_deleted": 720,
  "cutoff_date": "2026-03-10T00:00:00Z",
  "affected_measurements": ["cpu", "memory", "disk", "network"],
  "execution_time_ms": 1250.3
}

45 million rows across 720 files. That's what would have been deleted. If the numbers look right, run it for real:

curl -X POST http://localhost:8000/api/v1/retention/1/execute \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"dry_run": false, "confirm": true}'

The confirm: true flag is intentional friction. We don't want you accidentally deleting production data because you forgot to set dry_run.

Automatic Scheduling

Manual execution is fine for one-off cleanup, but you want this running automatically. Arc uses cron-style scheduling:

scheduler:
  retention_schedule: "0 3 * * *"   # 3 AM daily

The scheduler picks up all active policies, executes them sequentially, and logs the results. Each execution is recorded in a history table:

curl http://localhost:8000/api/v1/retention/1/executions \
  -H "Authorization: Bearer $TOKEN"

{
  "executions": [
    {
      "execution_time": "2026-04-09T03:00:01Z",
      "status": "completed",
      "deleted_count": 1500000,
      "cutoff_date": "2026-03-03T00:00:00Z",
      "execution_duration_ms": 3421
    },
    {
      "execution_time": "2026-04-08T03:00:01Z",
      "status": "completed",
      "deleted_count": 1200000,
      "cutoff_date": "2026-03-02T00:00:00Z",
      "execution_duration_ms": 2890
    }
  ]
}

This gives you a full audit trail of every retention execution — when it ran, how many rows and files were deleted, and how long it took.

Note: Automatic scheduling requires an enterprise license. Policy CRUD and manual execution are available in the community edition.

The Memory Fix: Why We Clear DuckDB's Cache

Here's a fun production story. We had a customer running nightly retention on a dataset with thousands of Parquet files. After each retention run, Arc's memory usage climbed by a few hundred MB and never came back down. After a week of nightly retention, the container was using 12 GB just sitting there.

The root cause: DuckDB caches Parquet metadata and data blocks internally when it reads files via read_parquet(). Retention uses read_parquet() to check the max timestamp in each file. Even though the files get deleted, DuckDB's internal cache still holds references to them.

Our fix: after every retention execution, we explicitly clear DuckDB's caches:

-- Clear the httpfs cache (glob results, file handles, data blocks)
SELECT cache_httpfs_clear_cache();
 
-- Reset parquet metadata cache
SET GLOBAL parquet_metadata_cache = false;
SET GLOBAL parquet_metadata_cache = true;

Then we call debug.FreeOSMemory() to return the freed heap pages to the OS. This is debounced (at most once every 30 seconds via atomic CAS) to prevent GC storms when multiple retention policies complete in rapid succession.

This runs on every execution — including dry runs — because read_parquet() populates the cache regardless of whether files are actually deleted.

Works Everywhere Your Data Lives

Retention works identically across all Arc storage backends:

Local filesystem: Files deleted via os.Remove(), empty directories cleaned up automatically
S3 / MinIO: Files deleted via DeleteObject, no empty directory concept (S3 is flat)
Azure Blob Storage: Files deleted via blob delete API

The policy doesn't know or care where the data lives. It uses Arc's storage abstraction layer, which means a policy created when your data was on local disk continues to work after you enable tiered storage and your old data moves to S3.

Retention + Tiered Storage: The Full Lifecycle

If you're using Arc's tiered storage (hot tier on local disk, cold tier on S3/Azure), the data lifecycle looks like this:

Ingest → Hot Tier (local, fast) → Cold Tier (S3, cheap) → Deleted (retention)
         │                         │                        │
         └─ max_age_days ──────────┘                        │
                                   └─ retention_days + buffer_days ─┘

Data lands on the hot tier (local SSD)
After max_age_days (e.g., 7 days), it's moved to the cold tier (S3)
After retention_days + buffer_days (e.g., 37 days), it's deleted entirely

Retention processes files wherever they currently live. If a file was tiered to S3 last week and is now past its retention date, it gets deleted from S3.

Full API Reference

Method	Path	Auth	Purpose
`GET`	`/api/v1/retention/`	Any	List all policies
`GET`	`/api/v1/retention/:id`	Any	Get specific policy
`POST`	`/api/v1/retention/`	Admin	Create policy
`PUT`	`/api/v1/retention/:id`	Admin	Update policy
`DELETE`	`/api/v1/retention/:id`	Admin	Delete policy
`POST`	`/api/v1/retention/:id/execute`	Admin	Execute (with dry_run option)
`GET`	`/api/v1/retention/:id/executions`	Any	Execution history

Practical Config

retention:
  enabled: true
 
scheduler:
  retention_schedule: "0 3 * * *"   # Run daily at 3 AM

That's the minimum. Create your policies via the API, set them to active, and the scheduler handles the rest.

For most deployments, we recommend:

30-day retention for operational metrics (cpu, memory, disk, network)
90-day retention for application metrics (request latency, error rates)
365-day retention for audit logs and compliance data
7-day buffer for everything (covers most late-arriving data scenarios)

Why File-Level Deletion Works

The reason this approach works so well is that it aligns with how Parquet files are structured. Each file is immutable, self-contained, and timestamped. Deleting a file is O(1) — no scanning, no rewriting, no compaction needed afterward.

The trade-off is precision. If a file contains data from 2 AM to 3 AM and your cutoff is 2:30 AM, the whole file survives until all its data ages past the cutoff. In practice, this means retention has hour-level granularity, not row-level. For most use cases — especially at the 30-day-or-longer retention periods where this matters — the difference is irrelevant.

The upside is that retention execution is fast, predictable, and doesn't impact running queries. No locks, no write amplification, no surprise GC pauses. Just file deletes.

That's the kind of trade-off we like.

Analytical Database

Streaming

AI Memory