Stop Paying Snowflake for Event Analytics

Your Snowflake bill is climbing. You already know why, but everyone keeps pretending it's a query optimization problem.

It isn't. It's an architecture problem. You're paying a cloud data warehouse to do a job it wasn't designed for: continuously ingesting and querying high-volume event data. Customer-facing dashboards. Product analytics. Application telemetry. Logs that got routed to the warehouse "because we already have Snowflake."

The pattern is the same everywhere I look. A team picks Snowflake for BI and exec reporting, where it's genuinely excellent. Then someone adds product events. Then someone adds application logs. Then someone adds a customer-facing dashboard that polls every 30 seconds. Six months later, the FinOps team is on a Zoom call asking why the warehouse credits doubled.

This post is about the second architecture, the one that handles the event data your warehouse was never meant to handle. I'll walk through what Snowflake actually is, why event data breaks its cost model, and how to build a Snowflake alternative for analytical workloads that have a timestamp on every row.

TL;DR

Snowflake bills compute per second with a 60-second minimum every time a warehouse resumes. Event data drives continuous query traffic, so warehouses never idle, cold starts force you to keep them warm, and concurrency scales by spinning up more clusters that each bill independently. The result is an order-of-magnitude cost increase versus the BI workload Snowflake was actually designed for.

What Snowflake actually is

Snowflake is a cloud data warehouse. Decoupled storage and compute. Data lives in Snowflake's proprietary micro-partition format on cloud object storage. Compute runs on "virtual warehouses," which are clusters you spin up on demand, billed per second with a 60-second minimum every time they resume.

This architecture is excellent for the workload it was designed for: batch analytics over structured business data. Run an exec dashboard at 8 AM. Spin up a warehouse for ten minutes. Spin it down. Bill is predictable. Run the quarterly board deck. Same pattern. Marketing attribution model. Same pattern. dbt transformations on overnight ETL'd data. Same pattern.

This is Snowflake's home turf and it's genuinely great at it. None of the rest of this post argues otherwise.

What breaks when you put event data in Snowflake

Event data has a different shape. High write volume, sometimes thousands or millions of rows per second. Continuous ingestion, not batched. Time-windowed reads, like "last 24 hours of user events" or "the last 5 minutes of error logs." Often customer-facing, like embedded dashboards that need to load in under a second on every page view.

If you want the foundational primer on why this workload looks so different from BI, the short answer is that event data is what columnar databases were actually designed for. Snowflake's micro-partition format is columnar too, but its billing and concurrency model assumes BI patterns, not continuous traffic.

Three things break:

Cost compounds non-linearly. Snowflake's per-second compute billing assumes warehouses spend most of their time idle. With continuous query traffic, the warehouse never sleeps. You're paying for uptime, not for work done. A customer-facing dashboard polling on every page view doesn't just cost more than the same dashboard run twice a day. It costs an order of magnitude more.

Cold starts kill UX. Suspended warehouses take 1 to 3 seconds to resume. Fine for an internal BI tool. Brutal for a "Your Account" page where a user is waiting to see their data. The workaround is keeping warehouses warm, which means paying for idle compute. Either way, you pay.

Concurrency scales by adding more warehouses. Multi-cluster warehouses are how Snowflake handles concurrent dashboard traffic. Each cluster runs and bills independently. Real-world customer-facing dashboards with a few thousand concurrent users can run 4 to 8 clusters in peak hours. The bill compounds fast.

If any of this sounds familiar, you're not querying inefficiently. You're using the wrong tool for the job.

The replacement: Arc

Arc is an analytical database designed for exactly this workload. A single Go binary. Native to S3, Azure Blob, MinIO, or local filesystem. Sustained 19.9M records/sec ingestion. Sub-second SQL queries on billions of rows. Open Apache Parquet on storage you already own.

Two things matter about that last point.

First: your data lives in your S3 bucket as standard Parquet files. Not in a proprietary format. You can query it with Arc, with DuckDB, with Spark, with ClickHouse, with Snowflake itself. Export costs nothing. There is no migration tax to leave.

Second: Arc doesn't bill you for warehouse uptime. You run it on infrastructure you control. The cost model is your compute and your storage, the same line items you already pay for everything else.

These two together are the difference. You stop renting access to your own data, and you stop paying a per-second meter to keep dashboards responsive.

The architecture you build

Three layers.

Ingest. Arc accepts data from anything that emits timestamped records. Telegraf, with its broad ecosystem of source plugins, covers most operational telemetry: system metrics, application logs, message queue stats, cloud provider APIs. The Python SDK using MessagePack columnar format covers application events from your own code at 19.9M records/sec sustained. Line Protocol works as a drop-in replacement for anything currently pointed at InfluxDB. Redpanda Connect and Kafka pipe stream-processed events into Arc directly. HTTP works for everything else.

You don't need to standardize on one ingestion path. Most production deployments use two or three in parallel.

Store. Arc writes Parquet files directly to your object storage backend. Background compaction merges small files into larger optimized ones, with typical 3-5x compression on event data and ZSTD encoding for cold data. Multi-database architecture lets you separate environments, tenants, or applications cleanly. Everything is queryable as Parquet, so the data is portable from day one.

Query. SQL via Arc's HTTP API, with full DuckDB function compatibility. Grafana speaks to Arc directly for dashboards. Superset works for BI-style analysis. Any application that issues SQL queries can hit Arc as a backend. And because the underlying storage is Parquet, you can also query the same files from any other Parquet-aware tool without going through Arc at all.

This is a Snowflake alternative for one specific kind of workload: event data with timestamps and high write volume. It's not a warehouse, and it's not pretending to be.

How to start

Three paths, depending on where you are.

Self-hosted, free.

docker run -d --name arc -p 8000:8000 \
  -e ARC_STORAGE_BACKEND=s3 \
  -e ARC_STORAGE_S3_BUCKET=my-arc-bucket \
  -e ARC_STORAGE_S3_REGION=us-east-1 \
  -e AWS_ACCESS_KEY_ID=... \
  -e AWS_SECRET_ACCESS_KEY=... \
  -v arc-data:/data/arc \
  ghcr.io/basekick-labs/arc:latest

Point Telegraf at it. Send your first event. AGPL-3.0, open source, no signup. Production-ready in minutes. If you'd rather try a SQL query on live demo datasets first, the Arc Playground is open and read-only.

Production scale.

Arc Enterprise unlocks the production features you need at scale: high availability, clustering, RBAC, audit logging, tiered storage, scheduled tasks, priority support. Same binary, same engine, same data format. License key flips the features on.

Migrating data already in Snowflake.

Snowflake's COPY INTO exports tables to Parquet on S3. Point Arc at the bucket and query immediately, no format conversion, no ETL job. You can run Arc and Snowflake side by side, with the same data files, and decide what to keep where. Move the workloads that bleed money. Keep the warehouse for what it's actually good at.

Why this matters

Your event data is your most expensive line item. It's also the data you have the least control over when it lives in a proprietary warehouse format. Both of those things are fixable.

Run Arc for the event analytics workload. Run Snowflake for the warehouse workload. Stop paying one tool to do a job it wasn't built for. Stop letting your data live in a format you can't read without renting access to it.

The point isn't that Snowflake is bad. The point is that it's expensive when used wrong, and most teams are using it wrong on the highest-volume part of their data.

The actually-useful CTA

Arc is on GitHub: github.com/Basekick-Labs/arc. Open source under AGPL-3.0.

If your Snowflake bill keeps growing and you want to talk about it, DM me on LinkedIn or email ignacio@basekick.net. Even if all you want is to compare notes on the cost math.

Analytical Database

Streaming

AI Memory

By industry

Explore

Read

Migrate from…

Forum

Source & Issues

Real-time chat