Arc on ClickBench: CrateDB. 50x on Ingestion, Not Close on Analytics.

CrateDB has an interesting position in the market. It's a distributed SQL database built on top of Lucene — think Elasticsearch with a proper SQL layer on top. It targets IoT, observability, and machine data at scale. Teams that want SQL semantics without giving up the horizontal scaling story of the Elastic stack often land on CrateDB.

So we ran Arc against it.

What CrateDB Is

CrateDB is built on Lucene, the same index engine that powers Elasticsearch. It adds a SQL layer, a planner, and distributed query execution on top. The storage model is still inverted indexes and doc values — not columnar Parquet files, not vectorized execution.

It supports standard SQL including GROUP BY, aggregations, and time-based queries. The ingestion path is HTTP JSON via /_sql with bulk inserts — there's no binary protocol, no COPY-style interface, no columnar wire format. Every batch goes through HTTP JSON parsing, SQL planning, and Lucene segment writes.

That's the architecture. The benchmark numbers follow from it.

A Note on Methodology

CrateDB is not on the official ClickBench leaderboard. That means we can't do an apples-to-apples comparison on the standard 43 ClickBench queries against the 99.9M row web analytics dataset.

Instead, we ran our own query suite on IoT/server metrics data — the same shape of data these systems are actually deployed for. The dataset has 5 columns: time, host, value, cpu_idle, cpu_user. CrateDB was loaded with 63 million rows. Arc had 111 million rows — about 75% more data.

This is worth disclosing upfront: Arc is querying a larger dataset. Despite that, Arc wins across the board.

Ingestion: The Protocol Gap

We ran sustained ingestion benchmarks on the same machine (Apple Silicon, 36GB RAM), loading IoT/server metrics data.

CrateDB's only viable bulk path is /_sql with bulk_args. We tested multiple worker/batch-size configurations to find the peak:

Config	Throughput	p50	p99
20 workers / 5K rows	271K rec/s	273ms	2,486ms
20 workers / 15K rows	298K rec/s	587ms	7,725ms
50 workers / 15K rows	350K rec/s	1,904ms	7,621ms
100 workers / 20K rows	328K rec/s	3,568ms	17,263ms

Peak: 50 workers / 15,000 rows → ~350K rec/s.

Beyond that, CrateDB saturates. At 100 workers with 20K batches, throughput actually drops — the benchmark took 39 seconds instead of 30. CrateDB was backpressuring: clients were queuing requests faster than Lucene could flush segments. The p99 of 17 seconds tells that story.

Arc ingested the same data shape at 17.5M rec/s using MessagePack columnar over HTTP.

	Arc	CrateDB
Throughput	17.5M rec/s	350K rec/s
Ratio	~50x faster	—

The reason isn't tuning — it's protocol and architecture. Arc's MessagePack columnar path sorts by time at ingest, then writes directly to Parquet via Apache Arrow with no query engine in the hot path. CrateDB routes every batch through HTTP JSON parsing, SQL planning, and Lucene segment management. There's a hard ceiling on that stack regardless of how many workers you throw at it.

Query Performance

Arc had 111M rows (compacted). CrateDB had 63M rows — about 75% less data.

Query	Arc (Arrow IPC)	Arc (JSON)	CrateDB
Count All	3.07ms	2.28ms	11.56ms
Select * LIMIT 1K	11.84ms	13.14ms	22.76ms
Select * LIMIT 10K	9.48ms	13.80ms	25.18ms
Select * LIMIT 100K	26.29ms	40.19ms	149.85ms
Select * LIMIT 500K	81.74ms	213.90ms	659.27ms
Select * LIMIT 1M	155.26ms	379.26ms	1,283.04ms
Time Range (1h)	1.57ms	3.28ms	22.82ms
Time Range (24h)	6.02ms	10.84ms	18.39ms
Time Range (7d)	5.64ms	8.71ms	18.42ms
Time Bucket (1h, 24h)	213.50ms	212.36ms	3,064.17ms
Time Bucket (1h, 7d)	208.14ms	211.74ms	2,740.65ms
Date Trunc (day, 30d)	206.03ms	211.20ms	2,796.57ms
SUM / AVG / MIN / MAX	119.31ms	119.16ms	695.84ms
Multi-column AGG	99.21ms	98.72ms	2,940.25ms
GROUP BY host	155.09ms	154.38ms	6,211.63ms
GROUP BY host+hour	286.48ms	281.08ms	7,342.88ms
DISTINCT hosts	30.77ms	28.62ms	2,516.90ms
Percentile p95	1,336.65ms	1,345.55ms	44,487.99ms
Top 10 by AVG	80.04ms	79.35ms	2,818.76ms
HAVING filter	80.64ms	79.91ms	2,818.18ms

Arc wins across the board on 75% more data. The gap ranges from ~4x on simple counts to ~82x on DISTINCT queries.

CrateDB is fast on simple operations — COUNT(*), LIMIT scans, and time range filters all come in under 25ms. That's Lucene doing what it's built for: fast indexed lookups on bounded result sets.

The pattern breaks under aggregation load. GROUP BY host takes 6.2 seconds on CrateDB; Arc handles it in 150ms — against 75% more data. GROUP BY host+hour: 7.3 seconds vs 280ms. Percentile p95: 44 seconds vs 1.3 seconds. SUM/AVG/MIN/MAX: 696ms vs 117ms.

Why the Gap Widens Under Aggregation

Simple scans and time range queries are CrateDB's home court. Lucene's inverted index structure is designed for exactly this: retrieve matching documents fast, return them. When ClickBench-style queries stay within that pattern, CrateDB looks competitive.

Aggregations break the model. To compute GROUP BY host or percentile_cont(0.95) over millions of rows, CrateDB has to deserialize doc values from Lucene segments, pull them into the JVM heap, and aggregate in-memory. The data isn't stored in a columnar format optimized for sequential scanning — it's structured for document retrieval. Under aggregation load, that gap opens up.

Arc uses DuckDB under the hood — vectorized execution engine, columnar Parquet storage, SIMD operations. Aggregating 111M rows of cpu_idle values is what that stack is built for.

Arrow IPC vs JSON

Looking at the two Arc formats side by side, Arrow IPC pulls ahead on large row scans where it skips JSON serialization entirely:

At 500K rows: Arrow is 2.6x faster than JSON (81.74ms vs 213.90ms)
At 1M rows: Arrow is 2.4x faster than JSON (155.26ms vs 379.26ms)

For aggregations and point queries, both formats are essentially identical — the result payload is tiny, so serialization overhead is negligible either way. At sub-10ms query times the difference is noise.

The practical rule: use Arrow IPC when pulling large result sets — exports, bulk reads, downstream pipelines. For aggregations and point queries, either format works fine.

The Honest Caveat

CrateDB has real strengths this benchmark doesn't capture.

It's distributed. CrateDB scales horizontally across nodes. Arc open source is single-node, but Arc Enterprise includes clustering and high availability. If you need horizontal scale today on the open-source tier, CrateDB has that out of the box.

It combines search and analytics. CrateDB can run full-text search queries alongside SQL aggregations on the same data. If your workload mixes keyword search, document retrieval, and analytics over the same dataset, CrateDB's Lucene foundation is an asset rather than a liability.

Operational familiarity. Teams that already run Elasticsearch or OpenSearch will recognize the data model, the deployment patterns, and the operational tooling. That's not nothing.

What this benchmark measures is what happens when you push CrateDB into a pure analytical workload — high-throughput ingestion and aggregation-heavy queries. If that's the dominant pattern, the architecture gap shows up.

Bottom Line

Metric	Arc vs CrateDB
Ingestion throughput	~50x faster (17.5M vs 350K rec/s)
COUNT / simple scans	~4x faster
Large row scans (100K–1M rows)	6–8x faster (on 75% more data)
Time bucket aggregations	~13–14x faster
GROUP BY aggregations	26–40x faster
Percentile p95	~33x faster
SUM / AVG / MIN / MAX	~6x faster
DISTINCT hosts	~82x faster
Storage format	Parquet (portable) vs Lucene segments (proprietary)
Horizontal scaling	CrateDB OSS; Arc Enterprise includes clustering

This is the fifth post in the Arc ClickBench series. Previous comparisons: ClickHouse, TimescaleDB, InfluxDB/DataFusion, Elasticsearch.

Next up: StarRocks.

Get started:

Questions or challenges? Find us on Discord or open an issue on https://github.com/Basekick-Labs/arc/issues.

Analytical Database

Streaming

AI Memory

By industry

Explore

Read

Migrate from…

Forum

Source & Issues

Real-time chat