Parquet vs Proprietary Formats: Why Standard Storage Matters

During my time at InfluxData, I watched a pattern repeat itself over and over.

A customer would come to us with terabytes of time-series data. Manufacturing telemetry. IoT sensor readings. Infrastructure metrics spanning years. They'd invested heavily in collecting this data, storing it, and building dashboards in Grafana.

Then someone from the data science team would ask: "Can we load this into pandas for analysis?"

And the answer was... complicated.

The data existed. It was valuable. But it was locked in a proprietary format that only one tool could efficiently read. Want to run your own analytics? Export to CSV first. Want to load it into Spark? Write a custom integration. Want to query it with standard SQL from a BI tool? Good luck.

This wasn't a bug. It was the architecture.

The Hidden Cost of Proprietary Formats

When you choose a time-series database, you're not just choosing a query engine. You're choosing a storage format. And that choice has consequences that don't show up until years later.

Proprietary formats create invisible walls around your data:

The Grafana-only trap. Visualization is solved—Grafana is excellent at what it does. But what about your data science team? What about that ML pipeline you want to build? What about the analyst who just wants to load five years of sensor data into a Jupyter notebook?

The migration tax. Want to switch databases? First you need to export everything. For petabyte-scale time-series data, that's not a weekend project. It's a multi-month migration with significant engineering investment. Most teams just... don't. They stay locked in.

The integration burden. Every time you want to connect a new tool, you're building a custom bridge. Custom exporters. Custom connectors. Custom ETL jobs. Each one is technical debt that someone has to maintain.

The "free" database that isn't. Many time-series databases are open source. The software costs nothing. But you pay with your flexibility. You pay with engineering hours building integrations. You pay with migration costs when you eventually need to move. The proprietary format is the lock-in mechanism.

I kept thinking: there has to be a better way.

What Parquet Actually Is

Apache Parquet is a columnar storage format that's become the industry standard for analytical workloads. It's what Snowflake, Databricks, BigQuery, and most modern data platforms use under the hood.

Here's why columnar matters for time-series data.

Traditional row-based storage writes data like this:

[timestamp1, device1, temp1, pressure1]
[timestamp2, device2, temp2, pressure2]
[timestamp3, device3, temp3, pressure3]

When you query "give me all temperatures from device1 for the last hour," the database has to read every column of every row to find the temperature values.

Columnar storage writes data like this:

timestamps: [timestamp1, timestamp2, timestamp3, ...]
devices:    [device1, device2, device3, ...]
temps:      [temp1, temp2, temp3, ...]
pressures:  [pressure1, pressure2, pressure3, ...]

Now a temperature query only reads the timestamp, device, and temperature columns. The pressure column is never touched. For wide tables with dozens of columns, this means reading 10-20% of the data instead of 100%.

But columnar storage does something else that matters: it enables much better compression.

When you store similar values together (all timestamps, all temperatures, all device IDs), compression algorithms find patterns more efficiently. Proprietary time-series databases typically achieve 2x compression. Parquet achieves 3-5x on the same data. That's real money at scale—the difference between $100/TB/month and $23/TB/month on S3.

How Arc Makes Parquet Fast for Time-Series

Parquet is great for analytics. But time-series workloads have specific patterns: data arrives ordered by time, queries almost always filter by time ranges, and you're constantly appending new data while querying historical data.

Arc adds three things on top of Parquet that make it work for time-series at scale.

Time-Based Partition Pruning

Arc automatically partitions data by time—year, month, day, or hour depending on your configuration. Each partition is a separate Parquet file.

When you query the last 24 hours of data, Arc doesn't scan your entire dataset. It identifies which partition files contain data from that time range and only reads those files. Query the last hour? You might read one file instead of thousands.

SELECT device_id, AVG(temperature)
FROM sensors
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY device_id

If you have a year of data partitioned by day, this query reads 1 file instead of 365. That's not a database optimization—that's avoiding 99.7% of the I/O entirely.

Pre-Computed Statistics

Every Parquet file stores metadata about its contents: minimum value, maximum value, count, and null count for each column. Arc uses these statistics to skip files before reading any actual data.

Query for temperatures above 100°C? If a file's maximum temperature is 85°C, Arc skips it entirely. No I/O, no decompression, no processing. The query planner uses statistics to eliminate files that can't possibly contain matching rows.

For filtered queries, this often eliminates 90%+ of files before any data is read.

Columnar Projection

When you SELECT device_id, temperature FROM sensors, Arc tells DuckDB to read only those two columns from the Parquet files. If your table has 50 columns, you're reading 4% of the data. This is native to Parquet—Arc just makes sure queries take advantage of it.

The combination of partition pruning, statistics-based filtering, and columnar projection means most queries touch a tiny fraction of your total data. That's how you get sub-second responses on billion-row datasets.

What "Portable Data" Actually Means

Here's the thing that changes everything: your data isn't locked in Arc.

Arc stores data in standard Parquet files. On S3, MinIO, Azure Blob Storage, or local disk. These files are readable by any tool that understands Parquet—which is basically every modern data tool.

Arc disappears tomorrow? You still own your data.

Let me show you what this looks like in practice.

Query Arc's Data with DuckDB (No Arc Running)

# Arc stores data in partitioned Parquet files
$ ls /data/arc/sensors/
2025-12-01.parquet
2025-12-02.parquet
2025-12-03.parquet
 
# Query directly with DuckDB CLI - no Arc server needed
$ duckdb -c "
SELECT device_id, AVG(temperature) as avg_temp
FROM read_parquet('/data/arc/sensors/*.parquet')
WHERE time > '2025-12-01'
GROUP BY device_id
ORDER BY avg_temp DESC
LIMIT 10
"

That's it. Standard DuckDB reading standard Parquet files. No special drivers. No export process. No Arc required.

Load Into Pandas for Data Science

import pandas as pd
 
# Load Arc's Parquet files directly
df = pd.read_parquet('/data/arc/sensors/')
 
# Now you have a DataFrame - do whatever you want
df_hourly = df.resample('1H', on='time').agg({
    'temperature': ['mean', 'min', 'max'],
    'pressure': 'mean'
})
 
# Train your ML model, build your reports, share with analysts

Your data scientists don't need Arc credentials. They don't need to learn a proprietary query language. They just read Parquet files with tools they already know.

Query from Spark, Snowflake, or Any Parquet-Compatible Tool

# Spark
df = spark.read.parquet("s3://your-bucket/arc/sensors/")
 
# Snowflake (external stage pointing to your S3 bucket)
SELECT * FROM @arc_stage/sensors/ (FILE_FORMAT => parquet_format);
 
# Polars
import polars as pl
df = pl.read_parquet("/data/arc/sensors/*.parquet")

This is what "portable data" actually means. Not "we have an export function." Not "we support CSV dumps." Your data is already in a format that every modern tool can read natively.

The Tradeoffs

I'd be lying if I said there were no tradeoffs. Parquet is optimized for analytical reads, not real-time writes.

Write amplification. Parquet files are immutable—you don't update them in place. Arc buffers incoming data and periodically flushes to new Parquet files, then compacts small files into larger ones. This works great for time-series (where data is append-only anyway), but it means there's a short delay before data is queryable.

Not ideal for point lookups. If your workload is "fetch one specific row by ID," columnar storage adds overhead compared to row-based formats. Time-series queries are almost never point lookups—they're aggregations over time ranges—so this rarely matters in practice.

File count management. High-cardinality ingestion can create many small files. Arc handles this with automatic compaction, merging small files into optimized larger ones. But it's something the system has to manage.

For time-series and analytical workloads, these tradeoffs are worth it. The flexibility of portable data, the compression savings, and the ecosystem compatibility outweigh the costs.

For OLTP workloads with lots of point lookups and updates? Use PostgreSQL. Different tool for a different job.

The Industry Is Moving This Direction

This isn't just Arc's opinion. The entire data industry is converging on open formats.

Databricks built Delta Lake on Parquet. Snowflake, BigQuery, and Redshift all support direct Parquet queries. The "data lakehouse" architecture—which is eating the traditional data warehouse market—is built on the premise that storage should be open and portable.

The proprietary format era is ending. The question isn't whether to adopt open formats—it's when.

We built Arc on Parquet because we believe your data should outlive your database vendor. You should be able to query your data with any tool, move it to any platform, and never worry about export processes or format conversions.

Zero proprietary formats. Zero lock-in.

This is what it looks like when you actually own your data.

Resources:

Analytical Database

Streaming

AI Memory