13 Years of NYC Citibike Data. 153 Million Trips. One Arc Instance.

Most time-series databases assume your old data doesn't matter. Retention policies delete anything older than 30 or 90 days. Storage costs spiral out of control. And when you actually need to query historical data? Good luck — you're either restoring from backups or mounting old volumes one at a time while your production database sits offline.

But for a lot of industries, old data isn't optional. It's required.

ITAR (International Traffic in Arms Regulations) requires defense contractors and aerospace companies to retain telemetry and manufacturing data for years — sometimes decades. FDA 21 CFR Part 11 mandates that pharmaceutical and medical device companies keep complete audit trails of equipment and process data. SOX (Sarbanes-Oxley) requires financial institutions to preserve records for 7+ years. NERC CIP demands energy companies retain operational data from grid infrastructure.

These aren't edge cases. If you're in aerospace, pharma, medical devices, energy, or finance — keeping years of time-series data isn't a "nice to have." It's the law.

The problem is that most time-series databases weren't built for this. They were built for dashboards showing the last 24 hours. Once you need years of history, you're fighting storage costs, degraded query performance, and operational nightmares.

That's the problem we built Arc to solve.

Compaction + S3 = Affordable Long-Term Retention

Arc stores data in Parquet files — a columnar format that compresses time-series data 3-5x compared to row-based storage. But compression alone isn't enough when you're talking about years of data.

Arc's compaction process continuously merges and optimizes Parquet files. Small files get consolidated. Data gets re-sorted for better compression ratios. The result is a storage footprint that stays lean even as your dataset grows over years.

Then there's the storage backend. Arc supports S3-compatible object storage (S3, MinIO, R2, GCS) as a first-class backend. Object storage is dramatically cheaper than block storage — we're talking $0.023/GB/month for S3 Standard vs. $0.10/GB/month for EBS. When you're storing years of sensor data, that difference adds up fast.

For Arc Enterprise, we're building tiered storage. Recent data — say the last 90 days — lives on fast local or SSD-backed storage for maximum query performance. Older data automatically moves to S3 or similar object storage. Same queries. Same SQL. Arc handles the routing transparently. You get hot performance on recent data and cold storage costs on everything else.

No more mounting volumes. No more stopping databases. No more choosing between query speed and storage bills.

153 Million Trips to Prove It

To show what this looks like in practice, we loaded every single NYC Citibike trip ever recorded into Arc. Every ride from June 2013 through January 2026. 153 million trips across 13 years of data. One Arc instance.

Try the live demo →

What You Get

An interactive map of New York City with 13 years of bike trip data at your fingertips.

Pick any year from 2013 to 2026. Pick any month. Filter by rider type — members or casual riders. Filter by bike type — electric or classic (available from 2020 onward). The map shows trip start points (green) and end points (red). Click any trip to see its full route, duration, start and end stations, and rider details.

There's a monthly trend chart at the bottom showing ride volume across the year. Click a month to jump to it. A stats panel breaks down total rides, member vs. casual split, bike type distribution, and the top stations for whatever period you're looking at.

And every single query shows its execution time. No hiding. You can see exactly how long Arc takes to answer.

13 Years of Open Data

The dataset covers June 2013 through January 2026. That's the entire history of NYC's Citibike program — from the first rides in lower Manhattan to today's sprawling network across all five boroughs.

Fun fact: the data has two completely different schemas. The 2013-2019 files use one format (with fields like usertype, birth year, and gender). The 2020+ files switched to a new format (with member_casual, rideable_type, and station IDs). Arc handles both transparently. Same table, same queries.

We'll update this dataset every month as NYC releases new ride data. It's public data, published monthly by Lyft and NYC DOT. When new months drop, we ingest them and the demo just gets bigger.

Why Picking Any Month Feels Instant

This is the part I'm most excited to talk about.

You've got 153 million rows. You select "March 2024" from the dropdown. The map updates in milliseconds. How?

Time-based pruning.

Arc stores data in Parquet files partitioned by time. When you query for a specific month, Arc's query planner looks at the partition metadata and skips every file that doesn't contain data for that time range. It never scans all 153 million rows. It never even opens the files it doesn't need.

So when you pick March 2024, Arc reads maybe a few Parquet files covering that month — not the thousands of files spanning 13 years. That's why the query time you see in the demo is typically under 100ms, regardless of which month or year you select. The dataset could be a billion rows and the experience would be the same.

This is the whole point of time-series partitioning. And it's automatic in Arc — you don't configure it. You don't create indexes. You just write data with timestamps and Arc organizes it for you.

Coming Soon: Built-In CSV Import

Here's a teaser. Loading 153 million rows from CSV files today required custom ingestion scripts — downloading the files from NYC's servers, parsing the two different schemas, converting to MessagePack columnar format, and streaming it all into Arc.

Starting with Arc 26.03.1, we're shipping built-in CSV import. Point Arc at a directory of CSV files, and it handles the rest. Schema detection, type inference, batch loading, progress tracking.

For datasets like Citibike — where the source data is published as CSV — this will make ingestion trivial. No scripts. No SDK. Just Arc and your files.

Try It

NYC Citibike Demo →

Pick a year. Pick a month. Watch Arc query 153 million rows in milliseconds. Click around. Compare summer vs. winter ridership. See how electric bikes took over after 2020. Find the busiest stations.

Then think about your own data. If you're sitting on years of sensor telemetry, equipment logs, or flight data — and your current database makes you choose between keeping it all and keeping it fast — Arc was built for you.

Try Arc: https://github.com/Basekick-Labs/arc

All demos: basekick.net/demos

Questions? Discord | Twitter | LinkedIn

Analytical Database

Streaming

AI Memory

By industry

Explore

Read

Migrate from…

Forum

Source & Issues

Real-time chat