Arc Cloud is live. Start free — no credit card required.

Arc on ClickBench: CrateDB. 50x on Ingestion, Not Close on Analytics.

#Arc#ClickBench#benchmark#CrateDB#performance#analytical database#IoT#ingestion#Lucene
Cover image for Arc on ClickBench: CrateDB. 50x on Ingestion, Not Close on Analytics.

CrateDB has an interesting position in the market. It's a distributed SQL database built on top of Lucene — think Elasticsearch with a proper SQL layer on top. It targets IoT, observability, and machine data at scale. Teams that want SQL semantics without giving up the horizontal scaling story of the Elastic stack often land on CrateDB.

So we ran Arc against it.

What CrateDB Is

CrateDB is built on Lucene, the same index engine that powers Elasticsearch. It adds a SQL layer, a planner, and distributed query execution on top. The storage model is still inverted indexes and doc values — not columnar Parquet files, not vectorized execution.

It supports standard SQL including GROUP BY, aggregations, and time-based queries. The ingestion path is HTTP JSON via /_sql with bulk inserts — there's no binary protocol, no COPY-style interface, no columnar wire format. Every batch goes through HTTP JSON parsing, SQL planning, and Lucene segment writes.

That's the architecture. The benchmark numbers follow from it.

A Note on Methodology

CrateDB is not on the official ClickBench leaderboard. That means we can't do an apples-to-apples comparison on the standard 43 ClickBench queries against the 99.9M row web analytics dataset.

Instead, we ran our own query suite on IoT/server metrics data — the same shape of data these systems are actually deployed for. The dataset has 5 columns: time, host, value, cpu_idle, cpu_user. CrateDB was loaded with 63 million rows. Arc had 111 million rows — about 75% more data.

This is worth disclosing upfront: Arc is querying a larger dataset. Despite that, Arc wins across the board.

Ingestion: The Protocol Gap

We ran sustained ingestion benchmarks on the same machine (Apple Silicon, 36GB RAM), loading IoT/server metrics data.

CrateDB's only viable bulk path is /_sql with bulk_args. We tested multiple worker/batch-size configurations to find the peak:

ConfigThroughputp50p99
20 workers / 5K rows271K rec/s273ms2,486ms
20 workers / 15K rows298K rec/s587ms7,725ms
50 workers / 15K rows350K rec/s1,904ms7,621ms
100 workers / 20K rows328K rec/s3,568ms17,263ms

Peak: 50 workers / 15,000 rows → ~350K rec/s.

Beyond that, CrateDB saturates. At 100 workers with 20K batches, throughput actually drops — the benchmark took 39 seconds instead of 30. CrateDB was backpressuring: clients were queuing requests faster than Lucene could flush segments. The p99 of 17 seconds tells that story.

Arc ingested the same data shape at 17.5M rec/s using MessagePack columnar over HTTP.

ArcCrateDB
Throughput17.5M rec/s350K rec/s
Ratio~50x faster

The reason isn't tuning — it's protocol and architecture. Arc's MessagePack columnar path sorts by time at ingest, then writes directly to Parquet via Apache Arrow with no query engine in the hot path. CrateDB routes every batch through HTTP JSON parsing, SQL planning, and Lucene segment management. There's a hard ceiling on that stack regardless of how many workers you throw at it.

Query Performance

Arc had 111M rows (compacted). CrateDB had 63M rows — about 75% less data.

QueryArc (Arrow IPC)Arc (JSON)CrateDB
Count All3.07ms2.28ms11.56ms
Select * LIMIT 1K11.84ms13.14ms22.76ms
Select * LIMIT 10K9.48ms13.80ms25.18ms
Select * LIMIT 100K26.29ms40.19ms149.85ms
Select * LIMIT 500K81.74ms213.90ms659.27ms
Select * LIMIT 1M155.26ms379.26ms1,283.04ms
Time Range (1h)1.57ms3.28ms22.82ms
Time Range (24h)6.02ms10.84ms18.39ms
Time Range (7d)5.64ms8.71ms18.42ms
Time Bucket (1h, 24h)213.50ms212.36ms3,064.17ms
Time Bucket (1h, 7d)208.14ms211.74ms2,740.65ms
Date Trunc (day, 30d)206.03ms211.20ms2,796.57ms
SUM / AVG / MIN / MAX119.31ms119.16ms695.84ms
Multi-column AGG99.21ms98.72ms2,940.25ms
GROUP BY host155.09ms154.38ms6,211.63ms
GROUP BY host+hour286.48ms281.08ms7,342.88ms
DISTINCT hosts30.77ms28.62ms2,516.90ms
Percentile p951,336.65ms1,345.55ms44,487.99ms
Top 10 by AVG80.04ms79.35ms2,818.76ms
HAVING filter80.64ms79.91ms2,818.18ms

Arc wins across the board on 75% more data. The gap ranges from ~4x on simple counts to ~82x on DISTINCT queries.

CrateDB is fast on simple operations — COUNT(*), LIMIT scans, and time range filters all come in under 25ms. That's Lucene doing what it's built for: fast indexed lookups on bounded result sets.

The pattern breaks under aggregation load. GROUP BY host takes 6.2 seconds on CrateDB; Arc handles it in 150ms — against 75% more data. GROUP BY host+hour: 7.3 seconds vs 280ms. Percentile p95: 44 seconds vs 1.3 seconds. SUM/AVG/MIN/MAX: 696ms vs 117ms.

Why the Gap Widens Under Aggregation

Simple scans and time range queries are CrateDB's home court. Lucene's inverted index structure is designed for exactly this: retrieve matching documents fast, return them. When ClickBench-style queries stay within that pattern, CrateDB looks competitive.

Aggregations break the model. To compute GROUP BY host or percentile_cont(0.95) over millions of rows, CrateDB has to deserialize doc values from Lucene segments, pull them into the JVM heap, and aggregate in-memory. The data isn't stored in a columnar format optimized for sequential scanning — it's structured for document retrieval. Under aggregation load, that gap opens up.

Arc uses DuckDB under the hood — vectorized execution engine, columnar Parquet storage, SIMD operations. Aggregating 111M rows of cpu_idle values is what that stack is built for.

Arrow IPC vs JSON

Looking at the two Arc formats side by side, Arrow IPC pulls ahead on large row scans where it skips JSON serialization entirely:

  • At 500K rows: Arrow is 2.6x faster than JSON (81.74ms vs 213.90ms)
  • At 1M rows: Arrow is 2.4x faster than JSON (155.26ms vs 379.26ms)

For aggregations and point queries, both formats are essentially identical — the result payload is tiny, so serialization overhead is negligible either way. At sub-10ms query times the difference is noise.

The practical rule: use Arrow IPC when pulling large result sets — exports, bulk reads, downstream pipelines. For aggregations and point queries, either format works fine.

The Honest Caveat

CrateDB has real strengths this benchmark doesn't capture.

It's distributed. CrateDB scales horizontally across nodes. Arc open source is single-node, but Arc Enterprise includes clustering and high availability. If you need horizontal scale today on the open-source tier, CrateDB has that out of the box.

It combines search and analytics. CrateDB can run full-text search queries alongside SQL aggregations on the same data. If your workload mixes keyword search, document retrieval, and analytics over the same dataset, CrateDB's Lucene foundation is an asset rather than a liability.

Operational familiarity. Teams that already run Elasticsearch or OpenSearch will recognize the data model, the deployment patterns, and the operational tooling. That's not nothing.

What this benchmark measures is what happens when you push CrateDB into a pure analytical workload — high-throughput ingestion and aggregation-heavy queries. If that's the dominant pattern, the architecture gap shows up.

Bottom Line

MetricArc vs CrateDB
Ingestion throughput~50x faster (17.5M vs 350K rec/s)
COUNT / simple scans~4x faster
Large row scans (100K–1M rows)6–8x faster (on 75% more data)
Time bucket aggregations~13–14x faster
GROUP BY aggregations26–40x faster
Percentile p95~33x faster
SUM / AVG / MIN / MAX~6x faster
DISTINCT hosts~82x faster
Storage formatParquet (portable) vs Lucene segments (proprietary)
Horizontal scalingCrateDB OSS; Arc Enterprise includes clustering

This is the fifth post in the Arc ClickBench series. Previous comparisons: ClickHouse, TimescaleDB, InfluxDB/DataFusion, Elasticsearch.

Next up: StarRocks.


Get started:

Questions or challenges? Find us on Discord or open an issue on https://github.com/Basekick-Labs/arc/issues.

Ready to handle billion-record workloads?

Deploy Arc in minutes. Own your data in Parquet. Use for analytics, observability, AI, IoT, or data warehousing.

Get Started ->