How Deduplication Works in Arc

#deduplication#compaction#ingestion#Parquet#DuckDB#time-series#Arc
Cover image for How Deduplication Works in Arc

If you re-run a backfill, replay an MQTT topic, retry a failed write after a flaky network link on the plant floor, or have a gateway re-publish a buffered batch when connectivity returns, you will eventually write the same sensor reading twice. Most ingestion pipelines have to decide what to do about that on the write path, and pay for it on every single write.

Arc takes a different approach: ingestion is append-only, and deduplication happens later, during compaction, for data that carries a dedup key. This post explains exactly what counts as a duplicate, which writes get de-duplicated (and which don't), what happens at each stage, and why the work is deferred.

TL;DR

  • Ingestion is append-only. Arc writes every point you send, as-is. Writing the same point twice produces two rows on disk. There is no unique constraint, no upsert, no write-time dedup. That's what keeps ingestion fast.
  • Deduplication happens at compaction. When Arc compacts a partition's small Parquet files into a larger one, it collapses duplicates in the same pass.
  • A duplicate is a row with the same series and the exact same timestamp. "Series" means the measurement plus the full set of dedup-key column values. Two points only collapse when every key column matches and the timestamp is identical to the microsecond.
  • Everything is stored as plain columns. Arc has no internal tag/field type split; "dedup key" is just a set of column names recorded in the Parquet metadata.
  • Dedup only runs when the data carries a dedup key. Arc records the key column names in the Parquet file's metadata at ingest. Today that happens on the Line Protocol (InfluxDB-compatibility) path. MessagePack columnar writes, Arc's primary format, carry no key, so they are compacted but not de-duplicated. Plan your writes accordingly.
  • COUNT(*) can drop after compaction. Duplicates are real rows until compaction removes them, so count- and sum-style aggregates over a freshly-ingested partition can shrink once it's compacted. The post-compaction number is the correct one.

The data model: columns, plus a "dedup key"

Arc stores everything as plain columns in Parquet. There is no internal "tag type" versus "field type": a measurement is a table, and every value you write, dimension or metric, is just a column alongside a time column. That uniform model is what lets Arc query the files with standard SQL and hand them to any Parquet-aware tool.

So where does deduplication get a key from? From column names that Arc has marked as the dedup key in the Parquet file's metadata. How those names get marked depends on how you write:

  • MessagePack columnar (Arc's primary, highest-throughput format). You send columns directly. Arc does not mark any of them as a dedup key, so MessagePack-ingested data is compacted but not de-duplicated. (More on this below; it's the most important caveat in this post.)
  • InfluxDB Line Protocol (the compatibility path). If you're coming from InfluxDB and pointing Telegraf or another InfluxDB-compatible tool at Arc, you're writing Line Protocol. That format carries an explicit tag/field split, and Arc preserves the tag column names as the dedup key:
temperature,machine_id=press_07,line=A celsius=72.4,vibration=0.31 1704067200000000
└────┬────┘ └────────────┬──────────────┘ └───────────┬──────────┘ └───────┬──────┘
measurement          tag columns                  field columns           timestamp

Here machine_id and line become the dedup-key columns; celsius and vibration are ordinary value columns. All four still land as plain columns in the same Parquet file; the tag/field distinction exists only as a hint that says "key dedup on these." The measurement plus a specific set of those key-column values defines a series: temperature for machine_id=press_07,line=A is one series; temperature for machine_id=press_08 is a different one.

A single point is therefore: one series, at one timestamp, carrying some values. That definition is the whole basis for deduplication, for the data that carries a dedup key.

Arc stores time as a TIMESTAMP WITH TIME ZONE (UTC) column. Everything is anchored to UTC internally; timestamps must be sent as a numeric epoch (seconds, milliseconds, microseconds, or nanoseconds; Arc auto-detects the unit).

At ingestion: append-only, by design

When a write arrives (Line Protocol, MessagePack, MQTT, or a bulk import), Arc parses it into columns and buffers them, then flushes the buffer to a Parquet file in the partition for that hour. That's it. There is no lookup to see whether the point already exists, no primary-key check, and no overwrite.

This is deliberate. A write-time uniqueness check would mean, for every incoming point, consulting an index of everything already written for that series and timestamp, turning a cheap append into a read-modify-write. At Arc's ingest rates (millions of records per second) that index would be the bottleneck. Parquet, the underlying format, has no notion of a primary key anyway; it is an immutable, append-only columnar file.

The practical consequence is simple and worth internalizing:

If you send the same point twice, Arc stores it twice. Both rows are real and both are queryable until the partition is compacted.

That is usually fine: duplicate points carry identical values, so an AVG or LAST over a short window is unaffected in practice, and a freshly-ingested hour is typically compacted within the hour. But if you query a not-yet-compacted partition right after a double-write, you may see the duplicate. Compaction is what makes it go away, provided the data was written with a dedup key (see below).

At compaction: where deduplication happens

Arc periodically compacts the many small Parquet files in a partition into one larger file. This is what keeps query performance and storage efficient as data accumulates (an hourly partition might collect dozens of small files; daily compaction rolls hours into a day). Deduplication rides along in that same pass, so it costs no extra scan.

When Arc compacts a partition, for each measurement whose files declare a dedup key it builds a query of this shape (simplified):

COPY (
  WITH normalized AS (
    SELECT * FROM read_parquet([...files...], union_by_name=true)
  )
  SELECT * FROM normalized
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY "machine_id", "line", "time"   -- all dedup-key columns + time
    ORDER BY "time" DESC
  ) = 1
  ORDER BY "time"
) TO 'compacted.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)

The dedup key is the PARTITION BY clause: every key column plus time. Read that as: group rows by (series, timestamp); keep one row per group. Because the partition includes all key columns, two readings only collapse when they belong to the same series and share the exact same timestamp. Two different machines reporting at the same instant are different series, so both survive, correctly; they are distinct measurements of distinct equipment.

A worked example. Suppose a partition contains:

machine_idlinetimecelsiusnote
press_07A170406720000000072.4
press_07A170406720000000072.4duplicate: same series + time
press_07A170406726000000072.6same series, different time
press_08A170406720000000068.1different series, same time

After compaction:

machine_idlinetimecelsius
press_07A170406720000000072.4
press_07A170406726000000072.6
press_08A170406720000000068.1

Only the exact (series, time) duplicate is removed. The different-timestamp row and the different-series row are both kept.

Which row wins?

Within a group of true duplicates (same series, same timestamp), Arc keeps one and drops the rest (ROW_NUMBER() ... = 1). Because every row in such a group shares the same timestamp, they are, by definition, the same point; in the common case their field values are identical too, so the choice is immaterial. The ordering (ORDER BY "time" DESC) is just a deterministic tiebreaker for the window function, not a meaningful "newest wins"; there is no newer timestamp within a same-timestamp group.

If you need true last-write-wins semantics for corrections (re-sending a point with the same series and timestamp but a changed field value), be aware that compaction's choice among same-key rows is arbitrary. The reliable pattern for corrections is to make the corrected point distinguishable (or to rely on your own write ordering) rather than to assume the most recent write is the one retained.

When dedup does not run

Deduplication is opt-in by data shape, not a flag. At ingest, Arc records the dedup-key column names in the Parquet file's key/value metadata (arc:tags). At compaction it reads that metadata; if a measurement's files declare key columns, dedup runs on (key columns, time). If a file carries no key metadata, that file is still compacted (its small files merged for performance), but rows are not de-duplicated, because without a key there is no series to group by. Key metadata is unioned across all files in the partition, so schema evolution (a key column appearing partway through) is handled.

In practice there are two cases where a file carries no key, and one is important:

  • MessagePack columnar writes. Arc's primary high-throughput format sends columns directly and does not declare a dedup key, so MessagePack-ingested data is compacted but never de-duplicated. If you ingest via MessagePack and need duplicates collapsed, you have to handle idempotency on the write side (or de-duplicate at query time with DISTINCT/GROUP BY).
  • Line Protocol with no tags. A Line Protocol measurement written with fields only and no tags has no key columns either, so it is compacted but not de-duplicated.

Why your row count can drop after compaction

This is the one behavior that surprises people, so it's worth stating plainly:

COUNT(*) over a partition can return a smaller number after that partition is compacted. That's not data loss; it's the duplicates being removed.

Because ingestion is append-only, a duplicate reading is two physical rows on disk until compaction runs. If you query the partition in that window, COUNT(*) counts both. After compaction collapses the (series, time) duplicate, the same query counts one. Arc logs exactly this when it happens: the compaction job records rows_before and rows_after and reports how many rows were de-duplicated.

A quick illustration, querying the same minute before and after compaction of the press_07 example above:

SELECT count(*) FROM temperature
WHERE machine_id = 'press_07' AND time = TIMESTAMP '2026-01-01 00:00:00Z';
-- before compaction: 2   (the duplicate reading is still on disk)
-- after  compaction: 1   (collapsed to one row)

Which queries are affected, and which are not:

  • Row-multiplicity-sensitive aggregates move. COUNT(*), SUM(celsius), and AVG(celsius) change when duplicates are removed: a duplicated reading was being counted/summed twice. After dedup they reflect the true single reading.
  • Idempotent aggregates do not. MIN, MAX, LAST, FIRST, and COUNT(DISTINCT ...) return the same answer before and after, because a duplicate doesn't change the extremes or the distinct set.

The post-compaction number is the correct one: the duplicate readings were never distinct measurements, so removing them gives you the count you actually meant. The takeaway for dashboards and alerts is simply: don't build logic that depends on the transient pre-compaction count of a freshly-ingested partition (for example, "alert if we received more than N readings this minute" computed seconds after write). If you need a stable, exact count immediately, query a window old enough to have been compacted, or count distinct (key columns, time) so the answer is dedup-invariant.

Why defer dedup to compaction?

Three reasons, all about keeping the hot path hot:

  1. Ingestion stays append-only and fast. No per-write index lookup, no read-modify-write. A write is a parse-and-append.
  2. Dedup is free-riding on work you already do. Compaction has to read and rewrite the partition's files regardless; collapsing duplicates in the same pass adds negligible cost and avoids a second scan.
  3. It matches how duplicates actually occur. Duplicates come in bursts (a retried batch, a replayed stream, a re-run backfill) landing close together in the same partition, exactly where the next compaction will catch them. Paying a uniqueness tax on every write to handle a bursty, occasional problem is the wrong trade for an analytical store.

The trade-off is eventual rather than immediate consistency for duplicates: there is a window, between a double-write and the partition's compaction, where both copies are visible. For Arc's workloads (analytics, observability, IoT telemetry, agent memory) that window is short and the duplicate values are typically identical, so it's rarely observable. In exchange you get ingestion that doesn't slow down as your dataset grows.

Practical guidance

  • Know whether your write path de-duplicates. If you ingest via Line Protocol with tags (the InfluxDB-compatibility path used by Telegraf and similar tools), re-sending a point is safe; compaction resolves it, and retries and at-least-once delivery are fine. If you ingest via MessagePack, there is no dedup: handle idempotency on the write side, or de-duplicate at query time.
  • For Line Protocol, make sure your data is tagged. The tag columns are the dedup key. A measurement written with fields only and no tags won't be de-duplicated.
  • Use exact, consistent timestamps for points you consider "the same." Dedup is microsecond-exact: 1704067200000000 and 1704067200000001 are different points and both will be kept.
  • Don't rely on compaction for field-level corrections. Same (series, time) with different field values collapses to one arbitrarily-chosen row; that's dedup, not an upsert.
  • Expect counts to settle after compaction. Don't build alerts or dashboards on the exact COUNT(*)/SUM(...) of a partition that may still be holding duplicates. For a stable count immediately, count distinct (key columns, time); that's dedup-invariant.

Summary

Arc separates writing from cleaning up: ingestion is append-only so it stays fast, and deduplication happens during compaction, keyed on the exact (series, time) pair: the full set of dedup-key column values plus a microsecond-precise timestamp. Same series, same instant, collapses to one row; anything that differs in a key column or in the timestamp is preserved. The one caveat to keep front of mind: that key is recorded today only on the Line Protocol ingest path, so MessagePack writes are compacted but not de-duplicated. It's a design that pays the dedup cost where it's cheap, on the work you were already going to do, instead of taxing every write.

Ready to handle billion-record workloads?

Deploy Arc in minutes. Own your data in open files on your storage. Use for analytics, observability, AI, IoT, or data warehousing.

Get Started ->