How DELETE Operations Work in Arc (And What's Coming in December)

So you've got a time-series database running on Parquet files. You're ingesting millions of rows, queries are blazing fast, and then someone asks: "Hey, can we delete some old data?"

And that's when things get interesting.

The Problem with Deleting from Parquet

Here's the thing about Parquet files - they're immutable. Once you write them, you can't just go in and remove a few rows. It's not like a database where you mark rows as deleted or update an index. With Parquet, you've got basically two options:

Tombstones - Keep the data but track which rows are "deleted"
File rewrites - Literally rewrite the file without the deleted rows

Most systems go with tombstones because they're cheaper. But we went with file rewrites. Let me explain why.

Why We Chose File Rewrites

The decision came down to our core philosophy with Arc: optimize for the common case.

Think about what actually happens in a time-series database. You're writing millions of rows per second—metrics from servers, logs from applications, traces from distributed systems. Those writes never stop. Your dashboards are constantly querying this data, alerts are checking thresholds every few seconds, analytics jobs are crunching numbers. This is the heartbeat of your observability stack.

And then there are deletes. Maybe you run retention policies once a day to drop old data. Maybe you get a GDPR request once a week. Maybe you need to clean up some bad test data. Point is: deletes are rare.

The tombstone approach optimizes for those rare deletes. Mark a row as deleted, done. Fast. But now every single write needs to generate and track row IDs or hashes. Every single query has to filter out those tombstones. That's overhead on operations that happen constantly—the stuff that actually matters for your system's performance.

With file rewrites, we flipped it. Deletes are expensive because we're literally reading Parquet files, filtering them through DuckDB, and writing new ones. But writes? Zero overhead. No hashing, no tombstone tracking, just pure write performance. Queries? Zero overhead. The deleted data is physically gone, no filtering needed. Your queries run on clean data, every time.

And here's the thing about making deletes expensive: they're almost always background jobs anyway. Your retention policy runs at 3 AM when nobody's looking. Your GDPR compliance script doesn't need to finish in milliseconds.

We're optimizing for what matters: blazing fast writes and queries. Deletes can afford to be a bit slower.

How It Actually Works

When you hit Arc's DELETE endpoint, here's what happens:

┌─────────────────────────────────────────────────────────────┐
│                    DELETE Flow in Arc                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Find Affected Files                                    │
│     ┌──────────────┐                                       │
│     │ Parquet File │  ← Scan partitions                    │
│     │ 100,000 rows │                                       │
│     └──────────────┘                                       │
│            ↓                                               │
│  2. Filter & Rewrite                                       │
│     ┌──────────────┐                                       │
│     │   DuckDB     │  ← Apply WHERE clause                 │
│     │   Filter     │                                       │
│     └──────────────┘                                       │
│            ↓                                               │
│     ┌──────────────┐                                       │
│     │ New Parquet  │  ← Write filtered data                │
│     │  95,000 rows │                                       │
│     └──────────────┘                                       │
│            ↓                                               │
│  3. Atomic Swap                                            │
│     ┌──────────────┐                                       │
│     │   Replace    │  ← Old file deleted                   │
│     │  old → new   │     5,000 rows gone forever           │
│     └──────────────┘                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

curl -X POST http://localhost:8000/api/v1/delete \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "database": "production",
    "measurement": "logs",
    "where": "time < '\''2025-01-01'\'' AND severity = '\''DEBUG'\''",
    "dry_run": false,
    "confirm": true
  }'

Step 1: Find Affected Files

Arc scans through your Parquet files looking for ones that contain rows matching your WHERE clause. Currently, this means checking every file in the measurement (we'll come back to this in a minute).

Step 2: Filter and Rewrite

For each affected file:

Read the file into memory using Arrow
Use DuckDB to evaluate your WHERE clause
Write a new file with only the rows that DON'T match
Atomically replace the old file with the new one

If a file ends up with zero rows after filtering? We just delete it entirely.

Step 3: Cleanup

The old data is physically gone. No tombstones, no orphaned rows, no cleanup needed later.

The Daily Compaction Challenge

Here's where it gets interesting. Arc runs compaction at two levels:

Hourly compaction (every 10 minutes) - merges recent small files
Daily compaction (3 AM UTC) - creates big daily files from hourly data

So you end up with files like:

/2025/11/17/14/cpu_compacted.parquet   # Hourly
/2025/11/17/cpu_20251117_daily.parquet  # Daily

Our DELETE implementation handles both automatically. When you delete data spanning multiple days, Arc finds and rewrites both the hourly files AND the daily compacted files. The data really is gone from everywhere.

Safety First

Here's the thing about deletes: once you run them, the data is gone. Really gone. No undo button, no rollback, no "oops I didn't mean to delete production data" recovery mode. So we built Arc with the assumption that you're going to make mistakes, and we're going to help you catch them before they matter.

Start with dry-run mode. It's basically a rehearsal for your delete operation—Arc shows you exactly what would get deleted without actually touching anything. You can validate your WHERE clause, see the row counts, check which files would be affected. Think of it as looking both ways before crossing the street.

curl -X POST http://localhost:8000/api/v1/delete \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "database": "default",
    "measurement": "logs",
    "where": "host = '\''old-server'\''",
    "dry_run": true
  }'

When you're ready to actually delete, Arc makes you think twice about big operations. By default, if you're about to delete more than 10,000 rows, you need to explicitly add confirm: true to your request. It's like the system tapping you on the shoulder saying "Hey, this is a lot of data—you sure?" You can tune this threshold via delete.confirmation_threshold to match your comfort level.

There's also a hard ceiling: one million rows per operation by default. You can't accidentally nuke your entire dataset in a single API call. If you need to delete more than that, you'll need to either increase delete.max_rows_per_delete in your config or break it into multiple operations. It's a safety rail, not a limitation.

And about SQL injection—your WHERE clauses go through DuckDB's parser before we touch any files. If the SQL doesn't parse cleanly, the operation fails. No string interpolation vulnerabilities, no sneaky DROP TABLE attempts.

Want to delete an entire measurement? You can, but you have to be explicit about it. No implicit "whoops I forgot the WHERE clause" disasters:

curl -X POST http://localhost:8000/api/v1/delete \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "database": "default",
    "measurement": "logs",
    "where": "1=1",
    "confirm": true
  }'

These aren't arbitrary restrictions. They're bumpers in the bowling alley—keeping you on track while you're moving fast.

What's Coming in December (v25.12.1)

Remember how I said Arc scans all files to find matches? That's about to change.

Partition Pruning for DELETE

We're bringing the same partition pruning optimization we use for queries to DELETE operations.

When your WHERE clause has a time filter:

curl -X POST http://localhost:8000/api/v1/delete \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "database": "production",
    "measurement": "logs",
    "where": "time >= '\''2025-11-10'\'' AND time < '\''2025-11-16'\''",
    "dry_run": true
  }'

Instead of scanning thousands of files, Arc will:

Extract the time range from your WHERE clause
Generate partition paths for just those days
Only scan ~150 files instead of 8,760+

That's a 10-100x speedup on file discovery.

Cloud Storage Support

Currently, DELETE only works with local storage. In v25.12.1, we're adding full support for:

Amazon S3
MinIO
Google Cloud Storage (GCS)
Ceph Object Storage

The flow for cloud storage:

List files using partition pruning (only relevant partitions)
Download affected files
Filter locally using DuckDB
Upload the rewritten files back
Delete the originals

curl -X POST http://localhost:8000/api/v1/delete \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "database": "production",
    "measurement": "metrics",
    "where": "time < '\''2025-10-01'\''",
    "confirm": true
  }'

Same safety guarantees, same atomic operations, just works across all your storage backends.

Performance: The Reality

Let's be honest about the trade-offs. Right now, without partition pruning, DELETE has to be a bit thorough—maybe too thorough. If you've got a year's worth of data, that's potentially 8,760+ hourly files sitting in your storage. When you run a DELETE, Arc scans through all of them looking for matches. On local storage, that's about 30 seconds of file discovery. On cloud storage? You're looking at 2 minutes just to figure out which files to touch. Then comes the actual rewrite work, which can take minutes to hours depending on how much data you're deleting.

It's not fast. We know it's not fast. But remember: this is a background operation. Your retention policy doesn't care if it takes 5 minutes or 50 minutes to run at 3 AM.

When v25.12.1 lands in December with partition pruning, the math changes completely. Let's say you're deleting a week of data. Instead of scanning 8,760 files, Arc will look at just the partitions for those seven days—about 170 files (7 days × 24 hours + 7 daily compacted files). File discovery drops from 30 seconds to half a second on local storage. On cloud? Five seconds instead of two minutes. The entire DELETE operation goes from "grab lunch" territory to "grab coffee" territory.

The file rewriting itself is still I/O bound—we're literally reading Parquet files, filtering them through DuckDB, and writing new ones. That part doesn't get faster with partition pruning. But finding which files to rewrite? That's about to get 10-100x faster, and that matters when you're running retention policies across terabytes of data.

When to Use DELETE

DELETE shines when you need to actually remove data and you're not in a hurry. Data retention policies are the classic example—once a day at 3 AM, drop everything older than 90 days. Nobody's watching, nobody's waiting for a response in milliseconds. Same with compliance requirements: when you get a GDPR request to remove a user's data, you need it gone, and you can afford to spend a few minutes making sure it's really deleted from every file.

Test data cleanup is another sweet spot. You've been running experiments in your staging environment, ingesting synthetic loads, and now you need to clear it out before the next test run. Fire off a DELETE, grab some coffee, come back to a clean slate.

Or maybe you ingested bad data during a deployment—wrong timestamps, malformed metrics, whatever. If you know the time range when things went sideways, DELETE lets you surgically remove just those rows and keep everything else intact.

But DELETE isn't the right tool for everything. If you're thinking about deleting data frequently—like, multiple times per hour—the overhead is going to hurt. Remember, we're rewriting entire Parquet files. Do that too often and you're spending more time on file I/O than actually running your database.

Row-level updates? Don't delete and re-insert. Just write the new data with an updated timestamp. Time-series databases are append-only by nature—embrace it. Your queries can use the latest value.

And if you need soft deletes—marking something as inactive but keeping it around—just add a status column. Way more efficient than running DELETE operations that don't actually need to remove the data.

Try It Out

If you're running Arc with local storage, DELETE is available now. Just make sure you enable it in your config:

[delete]
enabled = true
confirmation_threshold = 10000  # Require confirm for >10k rows
max_rows_per_delete = 1000000   # Hard limit

Start with dry-run mode to get a feel for it:

curl -X POST http://localhost:8000/api/v1/delete \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "database": "default",
    "measurement": "logs",
    "where": "time < '\''2025-01-01'\''",
    "dry_run": true
  }'

And in December, when v25.12.1 drops, you'll get partition pruning and cloud storage support automatically. No config changes needed.

The Bottom Line

We built DELETE the way we build everything in Arc - optimized for the real-world use case. Writes and queries need to be fast. Deletes can be a bit slower because they're rare and usually run as background jobs anyway.

The file rewrite approach keeps our architecture simple, our query performance predictable, and our data actually deleted when you delete it. No tombstone cleanup, no surprise storage growth, no query performance degradation over time.

And with partition pruning coming in December, even the "slow" part (file discovery) is about to get 10-100x faster.

That's the Arc way: make the common case fast, keep it simple, and actually delete your data when you ask us to.

Want to learn more? Check out the https://github.com/Basekick-Labs/arc/blob/main/docs/DELETE.md or join the discussion on our https://github.com/Basekick-Labs/arc.

Running Arc? We'd love to hear how you're using DELETE operations. Hit me up on Twitter or open an issue with your use case.

Analytical Database

Streaming

AI Memory