Compaction
Compaction is the process of merging many small data files into fewer, larger ones. It reduces storage overhead, improves compression, and makes queries faster by cutting the number of files an engine has to open.
Why small files hurt and compaction helps
High-volume ingestion creates a lot of small files. Every flush of incoming data writes a new file, and over hours and days those pile up into thousands or millions of tiny files. That is a problem, because every query has to open and read metadata for each file, and small files compress poorly.
Compaction fixes this by periodically merging those small files into large, well-organized ones. The result is fewer files to open, better compression, and queries that run in seconds instead of minutes. It is one of the least glamorous and most important parts of running an analytical store at scale.
If you build your own pipeline on raw files, compaction is one of the first things you have to solve, and it is harder than it looks to do well.
How Arc handles Compaction
Arc compacts automatically. It merges small ingested files into large ZSTD-compressed Parquet blocks in the background, so you get fast queries and roughly 90 percent storage reduction without managing the process yourself. This is one of the things you would have to build by hand if you pointed a query engine at raw Parquet.
Arc is a high-performance columnar database. Open Parquet on storage you own, single Go binary, production-ready in 30 seconds.