The $100K LLM Bill: Tracking AI Agent Costs with Arc

Cover image for The $100K LLM Bill: Tracking AI Agent Costs with Arc

Last month, I got a call from a friend running a startup. "Nacho, I need help. Our December AI bill was $127,000."

I almost dropped my coffee. "Wasn't it like $12K in November?"

"Yeah. Our CFO is losing his mind."

The culprit? Their customer support agent had a feature that let users ask follow-up questions. When it went viral on Product Hunt, each conversation spawned 15-20 API calls instead of the expected 3-4. No one noticed until the invoice arrived.

This is the story of how we built a cost monitoring system that would have caught this in hours, not weeks.

The Hidden Cost Explosion Problem

AI agents are different from traditional software. They make decisions autonomously, and those decisions have direct cost implications.

Consider what can go wrong:

Runaway loops. An agent decides it needs "more context" and keeps calling the API until it hits a token limit. I've seen a single user session burn $50 in minutes.

Verbose prompts. Someone adds "be thorough and detailed" to a system prompt. Suddenly every response is 4x longer. That's 4x the output tokens, 4x the cost.

Model upgrades without testing. "Let's try GPT-4 instead of GPT-4o-mini for better quality." Cool, that's a 66x price increase on output tokens.

Retry storms. Rate limits trigger retries. Each retry costs money. Under load, you can spend more on failed requests than successful ones.

Traditional APM tools track latency, errors, and throughput. They don't track token economics. You find out about cost problems when the invoice arrives—usually 30 days too late.

Why LLM Telemetry is Time-Series Data

Every LLM API call generates the same type of data:

  • Timestamp: When did this call happen?
  • Tokens: How many input/output tokens?
  • Cost: What did this specific call cost?
  • Model: Which model was used?
  • Metadata: Which agent, user, session?

This is textbook time-series data. High volume, append-only, time-ordered. And it has high cardinality—thousands of agent sessions across multiple models, users, and API keys.

Arc handles exactly this workload. Let's build the monitoring system.

Setting Up the Infrastructure

We'll use Arc for storage and Grafana for visualization. Here's the Docker Compose setup:

version: '3.8'
 
services:
  arc:
    image: ghcr.io/basekick-labs/arc:25.12.1
    container_name: arc
    ports:
      - "8000:8000"
    volumes:
      - arc-data:/app/data
    environment:
      - STORAGE_BACKEND=local
    restart: unless-stopped
 
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS=basekick-arc-datasource
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - arc
    restart: unless-stopped
 
volumes:
  arc-data:
  grafana-data:

Start it up:

docker compose up -d

Grab your Arc admin token from the logs:

docker logs arc | grep "Initial admin API token"

Save this token—you'll need it for API calls and Grafana configuration.

The Data Collection Script

We collect two types of data from OpenAI and store them in separate tables:

  • Usage APIllm_usage table: hourly token counts per model
  • Costs APIllm_costs table: daily costs per line item

This keeps the raw data intact—no aggregation in the collector. All analysis happens in SQL at query time.

Important: You need an Admin API key with the api.usage.read scope. Create one at platform.openai.com/settings/organization/admin-keys.

Here's the Python script:

#!/usr/bin/env python3
"""
LLM Cost Collector for Arc - Polls OpenAI Usage + Costs APIs and stores data in Arc.
Run hourly via cron: 0 * * * * /path/to/llm_cost_collector.py
 
Stores two separate tables:
- llm_usage: hourly token counts from Usage API
- llm_costs: daily costs from Costs API
"""
 
import os
import requests
import msgpack
from datetime import datetime, timezone, timedelta
 
ARC_URL = os.environ.get("ARC_URL", "http://localhost:8000")
ARC_TOKEN = os.environ.get("ARC_TOKEN")
OPENAI_ADMIN_KEY = os.environ.get("OPENAI_ADMIN_KEY")
 
def fetch_openai_usage(start_time: datetime, end_time: datetime) -> list:
    """Fetch token usage from OpenAI Usage API."""
    if not OPENAI_ADMIN_KEY:
        return []
 
    response = requests.get(
        "https://api.openai.com/v1/organization/usage/completions",
        headers={"Authorization": f"Bearer {OPENAI_ADMIN_KEY}"},
        params={
            "start_time": int(start_time.timestamp()),
            "end_time": int(end_time.timestamp()),
            "bucket_width": "1h",
            "group_by": ["model"]
        }
    )
    response.raise_for_status()
    data = response.json()
 
    records = []
    for bucket in data.get("data", []):
        ts = bucket["start_time"] * 1000
        for result in bucket.get("results", []):
            records.append({
                "time": ts,
                "provider": "openai",
                "model": result.get("model") or "unknown",
                "input_tokens": result.get("input_tokens", 0),
                "output_tokens": result.get("output_tokens", 0),
                "num_requests": result.get("num_model_requests", 0),
            })
    return records
 
def fetch_openai_costs(start_time: datetime, end_time: datetime) -> list:
    """Fetch costs from OpenAI Costs API."""
    if not OPENAI_ADMIN_KEY:
        return []
 
    day_start = start_time.replace(hour=0, minute=0, second=0, microsecond=0)
    day_end = (end_time + timedelta(days=1)).replace(hour=0, minute=0, second=0, microsecond=0)
 
    response = requests.get(
        "https://api.openai.com/v1/organization/costs",
        headers={"Authorization": f"Bearer {OPENAI_ADMIN_KEY}"},
        params={
            "start_time": int(day_start.timestamp()),
            "end_time": int(day_end.timestamp()),
            "bucket_width": "1d",
            "group_by": ["line_item"]
        }
    )
    response.raise_for_status()
    data = response.json()
 
    records = []
    for bucket in data.get("data", []):
        ts = bucket["start_time"] * 1000
        for result in bucket.get("results", []):
            line_item = result.get("line_item") or "unknown"
            amount = float(result.get("amount", {}).get("value", 0.0))
            if amount > 0:
                records.append({
                    "time": ts,
                    "provider": "openai",
                    "line_item": line_item,
                    "cost_usd": amount,
                })
    return records
 
def send_usage_to_arc(records: list):
    """Send usage records to Arc."""
    if not records:
        return
    data = {
        "m": "llm_usage",
        "columns": {
            "time": [r["time"] for r in records],
            "provider": [r["provider"] for r in records],
            "model": [r["model"] for r in records],
            "input_tokens": [r["input_tokens"] for r in records],
            "output_tokens": [r["output_tokens"] for r in records],
            "num_requests": [r["num_requests"] for r in records],
        }
    }
    response = requests.post(
        f"{ARC_URL}/api/v1/write/msgpack",
        headers={
            "Authorization": f"Bearer {ARC_TOKEN}",
            "Content-Type": "application/msgpack",
            "x-arc-database": "llm_costs"
        },
        data=msgpack.packb(data)
    )
    print(f"Sent {len(records)} usage records to Arc (status: {response.status_code})")
 
def send_costs_to_arc(records: list):
    """Send cost records to Arc."""
    if not records:
        return
    data = {
        "m": "llm_costs",
        "columns": {
            "time": [r["time"] for r in records],
            "provider": [r["provider"] for r in records],
            "line_item": [r["line_item"] for r in records],
            "cost_usd": [r["cost_usd"] for r in records],
        }
    }
    response = requests.post(
        f"{ARC_URL}/api/v1/write/msgpack",
        headers={
            "Authorization": f"Bearer {ARC_TOKEN}",
            "Content-Type": "application/msgpack",
            "x-arc-database": "llm_costs"
        },
        data=msgpack.packb(data)
    )
    print(f"Sent {len(records)} cost records to Arc (status: {response.status_code})")
 
def main():
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=24)
 
    print(f"Fetching data from {start_time} to {end_time}")
 
    # Fetch and store usage (tokens)
    usage_records = fetch_openai_usage(start_time, end_time)
    print(f"Found {len(usage_records)} usage records")
    send_usage_to_arc(usage_records)
 
    # Fetch and store costs (USD)
    cost_records = fetch_openai_costs(start_time, end_time)
    print(f"Found {len(cost_records)} cost records")
    send_costs_to_arc(cost_records)
 
if __name__ == "__main__":
    main()

Save this as llm_cost_collector.py and set up your environment:

export ARC_TOKEN="your-arc-token"
export OPENAI_ADMIN_KEY="your-openai-admin-key"  # Admin key, not regular API key
 
pip install requests msgpack
python llm_cost_collector.py

Set up a cron job to run hourly:

# Edit crontab
crontab -e
 
# Add this line (adjust paths as needed)
0 * * * * /usr/bin/python3 /path/to/llm_cost_collector.py >> /var/log/llm_costs.log 2>&1

Cost Tracking Queries

Now for the fun part—SQL queries that give you instant visibility into your LLM spending.

We have two tables:

  • llm_usage: hourly token counts (from Usage API)
  • llm_costs: daily costs in USD (from Costs API)

Note: Since the collector may run multiple times and re-fetch overlapping data, we use DISTINCT to deduplicate records. Let Grafana handle aggregation (Sum, Avg, etc.) in the panel settings.

Daily Burn Rate

How much are we spending per day?

SELECT time, line_item, cost_usd
FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
WHERE time > NOW() - INTERVAL '30 days'
ORDER BY time DESC;

Configure the Grafana panel to use Sum transformation to aggregate by day.

Token Usage by Model

Which models are consuming the most tokens?

SELECT time, model, input_tokens, output_tokens, num_requests
FROM (SELECT DISTINCT time, model, input_tokens, output_tokens, num_requests FROM llm_costs.llm_usage) d
WHERE time > NOW() - INTERVAL '7 days'
ORDER BY time DESC;

Cost by Line Item

OpenAI breaks down costs by model and direction (input/output):

SELECT time, line_item, cost_usd
FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
WHERE time > NOW() - INTERVAL '7 days'
ORDER BY cost_usd DESC;

Week-over-Week Cost Growth

Is spending accelerating?

SELECT
    week,
    weekly_cost,
    LAG(weekly_cost) OVER (ORDER BY week) as prev_week_cost,
    ROUND(
        (weekly_cost - LAG(weekly_cost) OVER (ORDER BY week)) /
        NULLIF(LAG(weekly_cost) OVER (ORDER BY week), 0) * 100,
        1
    ) as growth_percent
FROM (
    SELECT
        DATE_TRUNC('week', time) as week,
        SUM(cost_usd) as weekly_cost
    FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
    WHERE time > NOW() - INTERVAL '8 weeks'
    GROUP BY DATE_TRUNC('week', time)
) weekly_data
ORDER BY week DESC;

Hourly Token Spikes (Anomaly Detection)

Find hours where token usage spiked abnormally:

SELECT
    hour,
    hourly_tokens,
    avg_tokens,
    ROUND((hourly_tokens - avg_tokens) / NULLIF(stddev_tokens, 0), 2) as z_score
FROM (
    SELECT
        DATE_TRUNC('hour', time) as hour,
        SUM(input_tokens + output_tokens) as hourly_tokens,
        AVG(SUM(input_tokens + output_tokens)) OVER () as avg_tokens,
        STDDEV(SUM(input_tokens + output_tokens)) OVER () as stddev_tokens
    FROM (SELECT DISTINCT time, model, input_tokens, output_tokens FROM llm_costs.llm_usage) d
    WHERE time > NOW() - INTERVAL '7 days'
    GROUP BY DATE_TRUNC('hour', time)
) hourly_data
WHERE (hourly_tokens - avg_tokens) / NULLIF(stddev_tokens, 0) > 2
ORDER BY hour DESC;

A z-score above 2 means the hourly token usage was more than 2 standard deviations above average—worth investigating.

Grafana Dashboard

Let's visualize this data. First, install the Arc datasource plugin:

# Download and install
wget https://github.com/basekick-labs/grafana-arc-datasource/releases/latest/download/basekick-arc-datasource-1.0.0.zip
unzip basekick-arc-datasource-1.0.0.zip
docker cp basekick-arc-datasource grafana:/var/lib/grafana/plugins/
docker restart grafana

Configure the Arc datasource in Grafana:

  1. Go to Configuration → Data Sources → Add data source
  2. Select Arc
  3. Set URL to http://arc:8000 (container name)
  4. Enter your API token
  5. Set database to llm_costs

Dashboard Panels

Here's what your dashboard should include:

1. Total Spend (Last 7 Days) - Stat Panel

SELECT cost_usd
FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
WHERE time > NOW() - INTERVAL '7 days'

Set the stat panel calculation to Total to sum all values.

2. Daily Cost Trend - Time Series

SELECT time, cost_usd
FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
WHERE $__timeFilter(time)
ORDER BY time ASC

Daily cost trend showing spending over time

3. Cost by Line Item - Bar Chart

SELECT line_item, cost_usd
FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
WHERE $__timeFilter(time)
ORDER BY cost_usd DESC

4. Token Usage by Model - Bar Chart

SELECT model, input_tokens, output_tokens
FROM (SELECT DISTINCT time, model, input_tokens, output_tokens FROM llm_costs.llm_usage) d
WHERE $__timeFilter(time)

Model breakdown showing which models cost the most

5. Hourly Token Trend - Time Series

SELECT time, input_tokens, output_tokens
FROM (SELECT DISTINCT time, model, input_tokens, output_tokens FROM llm_costs.llm_usage) d
WHERE $__timeFilter(time)
ORDER BY time ASC

Complete LLM cost monitoring dashboard

Alerting on Cost Spikes

Here's the query that would have saved my friend $115,000. Set up a Grafana alert that fires when daily spending exceeds a threshold.

Create the Alert Rule

  1. Go to Alerting → Alert rules → New alert rule
  2. Set the query:
SELECT cost_usd
FROM (SELECT DISTINCT time, line_item, cost_usd FROM llm_costs.llm_costs) d
WHERE time > NOW() - INTERVAL '1 day'
  1. Set the alert condition to fire when Total is above 500
  2. Configure notification channel (Slack, PagerDuty, email)

Alert configuration for cost spikes

The Alert That Saves Money

With this alert in place, here's what would have happened to my friend's startup:

  • Day 1 of viral spike: Daily cost hits $800 by midday (projected $1,600+). Alert fires.
  • Within an hour: Team investigates, finds the follow-up question loop.
  • Same day: Hotfix deployed—rate limit on follow-ups.
  • Total damage: ~$2,000 instead of $115,000.

That's a 98% cost reduction from a single alert.

Taking It Further

Once you have the basic monitoring in place, consider these enhancements:

Per-user cost tracking. Add user_id to your data model. Find out which users (or API consumers) are driving costs. Implement usage limits.

Cost allocation. Tag API calls with team/department. Generate monthly chargebacks. Make cost visible to teams that create it.

Prompt optimization alerts. Track average tokens per request by endpoint. Alert when a prompt starts using significantly more tokens—someone might have added "be comprehensive" to a system prompt.

Budget forecasting. Use your historical data to project month-end spend. Alert at 50%, 75%, 90% of budget.

Conclusion

AI costs are the new infrastructure costs. They can scale from $12K to $127K in a month without anyone noticing—unless you're watching.

Arc makes this easy:

  • High-throughput ingestion for real-time cost data
  • SQL queries for instant cost analysis
  • Grafana integration for dashboards and alerts
  • Portable Parquet storage so your cost data isn't locked in

The script we built today takes 30 minutes to set up. It could save you six figures.

Don't wait for the invoice.


Resources:

Questions? Reach out on Twitter or LinkedIn.

Ready to handle billion-record workloads?

Deploy Arc in minutes. Own your data in Parquet.

Get Started ->