Arc on ClickBench: CrateDB. 50x on Ingestion, Not Close on Analytics.

CrateDB has an interesting position in the market. It's a distributed SQL database built on top of Lucene — think Elasticsearch with a proper SQL layer on top. It targets IoT, observability, and machine data at scale. Teams that want SQL semantics without giving up the horizontal scaling story of the Elastic stack often land on CrateDB.
So we ran Arc against it.
What CrateDB Is
CrateDB is built on Lucene, the same index engine that powers Elasticsearch. It adds a SQL layer, a planner, and distributed query execution on top. The storage model is still inverted indexes and doc values — not columnar Parquet files, not vectorized execution.
It supports standard SQL including GROUP BY, aggregations, and time-based queries. The ingestion path is HTTP JSON via /_sql with bulk inserts — there's no binary protocol, no COPY-style interface, no columnar wire format. Every batch goes through HTTP JSON parsing, SQL planning, and Lucene segment writes.
That's the architecture. The benchmark numbers follow from it.
A Note on Methodology
CrateDB is not on the official ClickBench leaderboard. That means we can't do an apples-to-apples comparison on the standard 43 ClickBench queries against the 99.9M row web analytics dataset.
Instead, we ran our own query suite on IoT/server metrics data — the same shape of data these systems are actually deployed for. The dataset has 5 columns: time, host, value, cpu_idle, cpu_user. CrateDB was loaded with 63 million rows. Arc had 111 million rows — about 75% more data.
This is worth disclosing upfront: Arc is querying a larger dataset. Despite that, Arc wins across the board.
Ingestion: The Protocol Gap
We ran sustained ingestion benchmarks on the same machine (Apple Silicon, 36GB RAM), loading IoT/server metrics data.
CrateDB's only viable bulk path is /_sql with bulk_args. We tested multiple worker/batch-size configurations to find the peak:
| Config | Throughput | p50 | p99 |
|---|---|---|---|
| 20 workers / 5K rows | 271K rec/s | 273ms | 2,486ms |
| 20 workers / 15K rows | 298K rec/s | 587ms | 7,725ms |
| 50 workers / 15K rows | 350K rec/s | 1,904ms | 7,621ms |
| 100 workers / 20K rows | 328K rec/s | 3,568ms | 17,263ms |
Peak: 50 workers / 15,000 rows → ~350K rec/s.
Beyond that, CrateDB saturates. At 100 workers with 20K batches, throughput actually drops — the benchmark took 39 seconds instead of 30. CrateDB was backpressuring: clients were queuing requests faster than Lucene could flush segments. The p99 of 17 seconds tells that story.
Arc ingested the same data shape at 17.5M rec/s using MessagePack columnar over HTTP.
| Arc | CrateDB | |
|---|---|---|
| Throughput | 17.5M rec/s | 350K rec/s |
| Ratio | ~50x faster | — |
The reason isn't tuning — it's protocol and architecture. Arc's MessagePack columnar path sorts by time at ingest, then writes directly to Parquet via Apache Arrow with no query engine in the hot path. CrateDB routes every batch through HTTP JSON parsing, SQL planning, and Lucene segment management. There's a hard ceiling on that stack regardless of how many workers you throw at it.
Query Performance
Arc had 111M rows (compacted). CrateDB had 63M rows — about 75% less data.
| Query | Arc (Arrow IPC) | Arc (JSON) | CrateDB |
|---|---|---|---|
| Count All | 3.07ms | 2.28ms | 11.56ms |
| Select * LIMIT 1K | 11.84ms | 13.14ms | 22.76ms |
| Select * LIMIT 10K | 9.48ms | 13.80ms | 25.18ms |
| Select * LIMIT 100K | 26.29ms | 40.19ms | 149.85ms |
| Select * LIMIT 500K | 81.74ms | 213.90ms | 659.27ms |
| Select * LIMIT 1M | 155.26ms | 379.26ms | 1,283.04ms |
| Time Range (1h) | 1.57ms | 3.28ms | 22.82ms |
| Time Range (24h) | 6.02ms | 10.84ms | 18.39ms |
| Time Range (7d) | 5.64ms | 8.71ms | 18.42ms |
| Time Bucket (1h, 24h) | 213.50ms | 212.36ms | 3,064.17ms |
| Time Bucket (1h, 7d) | 208.14ms | 211.74ms | 2,740.65ms |
| Date Trunc (day, 30d) | 206.03ms | 211.20ms | 2,796.57ms |
| SUM / AVG / MIN / MAX | 119.31ms | 119.16ms | 695.84ms |
| Multi-column AGG | 99.21ms | 98.72ms | 2,940.25ms |
| GROUP BY host | 155.09ms | 154.38ms | 6,211.63ms |
| GROUP BY host+hour | 286.48ms | 281.08ms | 7,342.88ms |
| DISTINCT hosts | 30.77ms | 28.62ms | 2,516.90ms |
| Percentile p95 | 1,336.65ms | 1,345.55ms | 44,487.99ms |
| Top 10 by AVG | 80.04ms | 79.35ms | 2,818.76ms |
| HAVING filter | 80.64ms | 79.91ms | 2,818.18ms |
Arc wins across the board on 75% more data. The gap ranges from ~4x on simple counts to ~82x on DISTINCT queries.
CrateDB is fast on simple operations — COUNT(*), LIMIT scans, and time range filters all come in under 25ms. That's Lucene doing what it's built for: fast indexed lookups on bounded result sets.
The pattern breaks under aggregation load. GROUP BY host takes 6.2 seconds on CrateDB; Arc handles it in 150ms — against 75% more data. GROUP BY host+hour: 7.3 seconds vs 280ms. Percentile p95: 44 seconds vs 1.3 seconds. SUM/AVG/MIN/MAX: 696ms vs 117ms.
Why the Gap Widens Under Aggregation
Simple scans and time range queries are CrateDB's home court. Lucene's inverted index structure is designed for exactly this: retrieve matching documents fast, return them. When ClickBench-style queries stay within that pattern, CrateDB looks competitive.
Aggregations break the model. To compute GROUP BY host or percentile_cont(0.95) over millions of rows, CrateDB has to deserialize doc values from Lucene segments, pull them into the JVM heap, and aggregate in-memory. The data isn't stored in a columnar format optimized for sequential scanning — it's structured for document retrieval. Under aggregation load, that gap opens up.
Arc uses DuckDB under the hood — vectorized execution engine, columnar Parquet storage, SIMD operations. Aggregating 111M rows of cpu_idle values is what that stack is built for.
Arrow IPC vs JSON
Looking at the two Arc formats side by side, Arrow IPC pulls ahead on large row scans where it skips JSON serialization entirely:
- At 500K rows: Arrow is 2.6x faster than JSON (81.74ms vs 213.90ms)
- At 1M rows: Arrow is 2.4x faster than JSON (155.26ms vs 379.26ms)
For aggregations and point queries, both formats are essentially identical — the result payload is tiny, so serialization overhead is negligible either way. At sub-10ms query times the difference is noise.
The practical rule: use Arrow IPC when pulling large result sets — exports, bulk reads, downstream pipelines. For aggregations and point queries, either format works fine.
The Honest Caveat
CrateDB has real strengths this benchmark doesn't capture.
It's distributed. CrateDB scales horizontally across nodes. Arc open source is single-node, but Arc Enterprise includes clustering and high availability. If you need horizontal scale today on the open-source tier, CrateDB has that out of the box.
It combines search and analytics. CrateDB can run full-text search queries alongside SQL aggregations on the same data. If your workload mixes keyword search, document retrieval, and analytics over the same dataset, CrateDB's Lucene foundation is an asset rather than a liability.
Operational familiarity. Teams that already run Elasticsearch or OpenSearch will recognize the data model, the deployment patterns, and the operational tooling. That's not nothing.
What this benchmark measures is what happens when you push CrateDB into a pure analytical workload — high-throughput ingestion and aggregation-heavy queries. If that's the dominant pattern, the architecture gap shows up.
Bottom Line
| Metric | Arc vs CrateDB |
|---|---|
| Ingestion throughput | ~50x faster (17.5M vs 350K rec/s) |
| COUNT / simple scans | ~4x faster |
| Large row scans (100K–1M rows) | 6–8x faster (on 75% more data) |
| Time bucket aggregations | ~13–14x faster |
| GROUP BY aggregations | 26–40x faster |
| Percentile p95 | ~33x faster |
| SUM / AVG / MIN / MAX | ~6x faster |
| DISTINCT hosts | ~82x faster |
| Storage format | Parquet (portable) vs Lucene segments (proprietary) |
| Horizontal scaling | CrateDB OSS; Arc Enterprise includes clustering |
This is the fifth post in the Arc ClickBench series. Previous comparisons: ClickHouse, TimescaleDB, InfluxDB/DataFusion, Elasticsearch.
Next up: StarRocks.
Get started:
Questions or challenges? Find us on Discord or open an issue on https://github.com/Basekick-Labs/arc/issues.
Ready to handle billion-record workloads?
Deploy Arc in minutes. Own your data in Parquet. Use for analytics, observability, AI, IoT, or data warehousing.