Data Ingestion

Data ingestion is the process of bringing data from its sources into a system where it can be stored, processed, and analyzed. It is the first step of any data pipeline and can happen in batches or as a continuous stream.

Batch and streaming ingestion

Ingestion covers everything involved in getting data in: connecting to sources, accepting the data, validating it, and writing it to storage. It sounds simple, but at scale it is where many systems struggle, especially when data arrives fast and continuously.

There are two broad modes. Batch ingestion loads data in scheduled chunks, which is simple but adds latency. Streaming ingestion accepts data continuously as it arrives, which enables real-time analytics but demands a write path that can keep up with the incoming rate.

The harder the ingestion requirements, high volume, low latency, the more the design of the write path matters.

How Arc handles Data Ingestion

Arc is built around a high-throughput ingestion path, sustaining 19.9 million records per second on a single instance and making data queryable in about 100 milliseconds. It accepts multiple formats including line protocol, JSON, and Arrow.

Arc is a high-performance columnar database. Open Parquet on storage you own, single Go binary, production-ready in 30 seconds.