Apache Parquet

Apache Parquet is an open source, column-oriented file format designed for efficient storage and fast analytical queries. It is the de facto standard for storing large analytical datasets on object storage like Amazon S3.

Why Parquet is everywhere

Parquet stores data by column, which means analytical queries read only the columns they need and skip the rest. It also applies strong compression per column, so files are small and cheap to store. Both of those make it ideal for the kind of large scans that analytics depends on.

The bigger reason Parquet matters is that it is open. Almost every modern data tool can read it: DuckDB, Spark, Pandas, Polars, Snowflake, BigQuery, and dozens more. Your data is not trapped inside one vendor's engine. If you store your data as Parquet, you can pick it up and move it anywhere.

That portability is why Parquet has become the foundation of the modern data lake and lakehouse. It separates your data from the engine that queries it.

How Arc handles Apache Parquet

Arc stores all of your data as Parquet on storage you own, whether that is local disk, S3, MinIO, GCS, or Ceph. There is no proprietary format in the middle. If you ever stop using Arc, you still have every byte in open Parquet that any tool can read.

Arc is a high-performance columnar database. Open Parquet on storage you own, single Go binary, production-ready in 30 seconds.