Arc Cloud is live. Start free — no credit card required.

Send Home Assistant Data to Arc, Queryable Forever

#Arc#Home Assistant#IoT#smart home#Line Protocol#InfluxDB#Telegraf#Grafana#homelab#tutorial#sensors
Cover image for Send Home Assistant Data to Arc, Queryable Forever

Home Assistant's built-in Recorder is fine for a couple weeks of state history. Past that, it's a SQLite file that grows until something else breaks first. You can crank up retention, you can swap in MariaDB, you can run the Recorder add-on, and you'll still end up with a database that wasn't designed for "five years of every motion sensor in the house" workloads.

We can fix that with two existing things we don't have to build: Home Assistant's built-in InfluxDB integration, and Arc.

Arc speaks the InfluxDB Line Protocol on the write path. Home Assistant has shipped an "InfluxDB" integration since approximately the dawn of time. Point one at the other, and every state change your house produces lands in a columnar Parquet archive you actually own. No add-on. No glue script. No Python.

Let's wire it up.

Why Arc for This

  • Line Protocol compatibility: HA's built-in integration writes Line Protocol over HTTP. Arc accepts Line Protocol over HTTP. They speak the same language already, we're just convincing them to dial each other.
  • Parquet storage you keep: Every flush from Arc lands in time-partitioned Parquet on S3, MinIO, Azure Blob, or local disk. No proprietary format. DuckDB, Polars, Spark, BigQuery, Snowflake, anything that reads Parquet can read your home's history.
  • SQL on top: Once it's in Arc, "what's my average bedroom temperature at 3am over the last six months" is a query, not a project.

What You'll Need

  • A machine with Docker installed (any small box on your LAN works; a Pi 4 or 5 is plenty).
  • About fifteen minutes.

If you've never used Arc before, we get it running in one command in the next section. If you've never used Home Assistant either, we cover that next. If both are already up, skip ahead to How It Fits Together.

Get Arc Running (If You Don't Have It Yet)

Arc is a high-performance columnar analytical database designed for time-series workloads. It speaks the same write protocols as InfluxDB (Line Protocol, MessagePack), stores data as plain Apache Parquet on object storage or local disk, and exposes a SQL query API over the result. It ships as a single Go binary, packaged as a Docker image.

For a homelab setup, local-disk storage is fine. One container, one volume for the Parquet files:

docker run -d -p 8000:8000 \
  --name arc \
  --restart unless-stopped \
  -e STORAGE_BACKEND=local \
  -v arc-data:/app/data \
  ghcr.io/basekick-labs/arc:latest

Confirm it's up:

curl http://localhost:8000/health
# {"status":"ok","time":"2026-05-12T15:31:42Z","uptime":"6.794009379s","uptime_sec":6.794009379}

On first boot Arc generates an initial admin token and writes it to the container logs. Grab it:

docker logs arc 2>&1 | grep "Admin API token"
# Admin API token: ark_abc123...xyz

Save that token. We'll use it in Step 1 to create the database and mint a write-scoped token for Home Assistant. The admin token itself stays on the Arc host; we never hand it to HA.

If you want production-grade storage (S3, MinIO, Azure Blob, clustering, RBAC), the Arc documentation covers it. For a homelab, the single-container local-disk setup above is enough to run for years.

Get Home Assistant Running (If You Don't Have It Yet)

Home Assistant is the open source home automation hub that everyone reading this has probably already heard of. If you have HA up and accessible at http://your-ha-host:8123, skip to Step 1.

If you don't, the official recommendation is Docker Compose. Save this as docker-compose.yml somewhere on your Docker host:

services:
  homeassistant:
    container_name: homeassistant
    image: ghcr.io/home-assistant/home-assistant:stable
    volumes:
      - ./ha-config:/config
      - /etc/localtime:/etc/localtime:ro
      - /run/dbus:/run/dbus:ro
    restart: unless-stopped
    privileged: true
    network_mode: host
    environment:
      TZ: America/Costa_Rica

Change TZ to your timezone. Then:

docker compose up -d

HA needs network_mode: host for mDNS device discovery, Bluetooth, and Z-Wave/Zigbee USB sticks to work. On a Linux Docker host, this is the right setup.

On Docker Desktop for macOS or Windows, network_mode: host does not work. Docker Desktop runs containers inside a Linux VM, so "host" means the VM, not your machine. HA boots, but you can't reach it. Use this compose instead for Mac and Windows testing:

services:
  homeassistant:
    container_name: homeassistant
    image: ghcr.io/home-assistant/home-assistant:stable
    volumes:
      - ./ha-config:/config
      - /etc/localtime:/etc/localtime:ro
    restart: unless-stopped
    privileged: true
    ports:
      - "8123:8123"
    environment:
      TZ: America/Costa_Rica

Note: this variant drops /run/dbus (doesn't exist on macOS) and swaps host networking for an explicit port mapping. You lose mDNS and USB-stick passthrough, which is fine for a tutorial run-through and not fine for a real production HA install.

Open http://localhost:8123 (Mac/Windows) or http://your-host-ip:8123 (Linux) in a browser. HA will walk you through the onboarding wizard: name, location, units, account. Five minutes, no surprises. When you land on the default dashboard, HA is up.

The ./ha-config directory now holds your configuration.yaml and secrets.yaml. We'll edit both in Step 2.

The official HA install docs have more options (HA OS, Supervised, Core) if Docker Compose isn't your speed: home-assistant.io/installation.

How It Fits Together

+------------------+      Line Protocol       +-----------+      Parquet
| Home Assistant   |  --------------------->  |    Arc    |  ----------->  S3 / MinIO / disk
|  (InfluxDB       |   POST /api/v1/write     |           |
|   integration)   |   Bearer <token>         +-----------+
+------------------+

HA calls it the "InfluxDB integration." We're going to use it to talk to a database that isn't InfluxDB. Welcome to time-series, where every vendor implements the same write API and quietly pretends nobody else is doing it.

Step 1: Create an Arc Database and Token

Strictly speaking, you don't have to create the database up front. Arc auto-creates databases the first time data is written to them, so if you skip this step and start HA, the home_assistant database will pop into existence the moment the first state change lands. We're doing it explicitly here so the token we create in a second can be scoped to exactly one database, which is the polite thing to do.

Run these on the Arc host (or wherever you can reach it):

export ARC_TOKEN="arc_paste_your_admin_token_here"
 
# Create the database (optional, see note above)
curl -X POST http://localhost:8000/api/v1/databases \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "home_assistant"}'
# {"name":"home_assistant","measurement_count":0,"created_at":"2026-05-12T15:41:58Z"}
 
# Create a write-scoped token for HA to use
curl -X POST http://localhost:8000/api/v1/auth/tokens \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "home-assistant", "permissions": ["write"], "databases": ["home_assistant"]}'
# {"message":"Token created successfully. Store this token securely - it cannot be retrieved again.","success":true,"token":"M1AoPn5nUwMOEqhS27UeaDUWl4UqbVgRIqtyUlk1Mag="}

Copy the token that comes back. We'll drop it into HA's secrets file in a second. Don't email it to yourself.

Step 2: Point Home Assistant at Arc

Open Home Assistant's configuration.yaml. If you're using the File Editor add-on, it's right there. If you're on Container with no add-ons, it's in your HA config volume.

Add Arc's token to secrets.yaml (note the .yaml extension; HA does not pick up .yml):

# secrets.yaml
arc_token: "your-write-scoped-arc-token-here"

Then add the InfluxDB integration block to configuration.yaml. We're using YAML to bootstrap connection settings; HA will import them into a UI config entry on first run and then ask us to remove them. We'll do that in a minute. For now, paste the full block:

# configuration.yaml
influxdb:
  api_version: 2
  ssl: false
  host: arc.example.lan
  port: 8000
  token: !secret arc_token
  organization: ""
  bucket: home_assistant
  measurement_attr: domain__device_class
  default_measurement: state
  max_retries: 3
  include:
    domains:
      - sensor
      - binary_sensor
      - climate
      - light
      - switch
      - media_player
      - person
      - sun
      - weather
  exclude:
    entity_globs:
      - sensor.*_uptime
      - sensor.*_last_boot

A few notes on those keys:

  • host is the address HA uses to reach Arc. The right value depends on your setup:
    • HA and Arc on different machines: the LAN hostname or IP of the Arc host (e.g. arc.example.lan or 192.168.1.50).
    • HA and Arc both in containers on a Linux Docker host: put both services in the same Docker network and use the service name (arc). The simplest way to do that is to add Arc to your HA docker-compose.yml so they share the same auto-created network.
    • HA and Arc both in containers on Docker Desktop for macOS or Windows: use host.docker.internal. Each container reaches the host's loopback that way, and your -p 8000:8000 port mapping on Arc handles the rest.
    • localhost only works if HA and Arc share a network namespace, which is rare. Don't reach for it first.
  • api_version: 2 is what tells HA to authenticate with a bearer token instead of basic auth. Arc's /api/v1/write endpoint accepts the same Bearer header, so HA's v2 path lands in the right place.
  • organization is a v2-required field that Arc ignores. Leave it empty or put any string; it does not matter.
  • bucket maps to Arc's database name.
  • measurement_attr: domain__device_class keeps the schema sensible. Without it, HA defaults to writing every entity into a single state measurement and you'll spend an evening reverse-engineering your own data.
  • include/exclude is where you decide whether to fire-hose Arc with every entity in your house or curate it. Start permissive, tighten later.

After First Boot, Trim the YAML

Home Assistant is in the middle of moving the InfluxDB integration off YAML and into the UI. Connection settings now live in a UI config entry, and HA imports them from configuration.yaml on first run. After the import, HA shows a warning under Settings → Repairs that looks like:

The InfluxDB YAML configuration is being removed. Your existing YAML connection configuration has been imported into the UI automatically. Remove the influxdb connection and authentication keys from your configuration.yaml file and restart Home Assistant to fix this issue.

This is expected and the message itself tells you exactly what to do. Trim configuration.yaml down to the keys HA still reads from YAML (measurement_attr, default_measurement, max_retries, include, exclude, tags, and a couple of others):

# configuration.yaml (after the cleanup)
influxdb:
  measurement_attr: domain__device_class
  default_measurement: state
  max_retries: 3
  include:
    domains:
      - sensor
      - binary_sensor
      - climate
      - light
      - switch
      - media_player
      - person
      - sun
      - weather
  exclude:
    entity_globs:
      - sensor.*_uptime
      - sensor.*_last_boot

Restart HA. The Repairs warning clears. From here on, the UI is the source of truth for connection settings; YAML is the source of truth for shape and filters.

If you're starting fresh today on a brand new HA install, you can skip the YAML connection block entirely and configure the integration from Settings → Devices & Services → Add Integration → InfluxDB instead. The YAML path is just easier to copy-paste from a tutorial. Either route ends at the same place.

The hard deprecation is scheduled for HA 2026.9.0; the YAML connection keys stop working entirely after that. Anything in the trimmed block above keeps working.

Step 3: Restart Home Assistant and Watch the Data Flow

Restart HA:

docker compose restart homeassistant

Then open http://localhost:8123 and complete the onboarding wizard (name, location, units, account) if you haven't already. The InfluxDB integration doesn't activate until HA is past onboarding, so a fresh install will show no data flowing until you finish that one-minute setup.

Three Places to Check That It's Working

1. HA UI → Settings → Devices & Services → InfluxDB

The integration should appear with a title like home_assistant (http://your-arc-host:8000) and no error indicator. That confirms HA imported the YAML config into a UI entry and the integration loaded.

2. HA UI → Settings → Repairs

After you trimmed the YAML connection keys, this panel should be clean. If you still see "The InfluxDB YAML configuration is being removed," your configuration.yaml still has connection keys in it. Remove them and restart.

3. Query Arc directly

This is the real proof. List your databases:

curl -s http://localhost:8000/api/v1/databases \
  -H "Authorization: Bearer $ARC_TOKEN" | python3 -m json.tool

You should see home_assistant with a non-zero measurement_count. Then list the tables and run a query against the weather measurement (HA's default install ships with a weather entity, so this is the easiest one to query right after onboarding):

# List tables in the home_assistant database
curl -s -X POST http://localhost:8000/api/v1/query \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SHOW TABLES FROM home_assistant"}' | python3 -m json.tool
 
# Look at the most recent weather rows
curl -s -X POST http://localhost:8000/api/v1/query \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "x-arc-database: home_assistant" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT entity_id, state, time FROM weather ORDER BY time DESC LIMIT 5"}' \
  | python3 -m json.tool

You should see a couple of rows back with entity_id: forecast_home and the current conditions. HA flushes batches every few seconds, so even on a brand-new install with no devices added, weather and sun.sun state changes start landing within ~30 seconds of onboarding completion.

If after all that the Arc database is still empty: check the HA logs (docker logs --tail 100 homeassistant | grep -i influxdb) for connection errors. The two common ones are DNS (the host value isn't reachable from inside HA's container) and auth (token wrong, expired, or missing write permission).

What Home Assistant Sends Arc

The Line Protocol HA emits looks roughly like this for a temperature sensor:

sensor.temperature,entity_id=bedroom_temperature,friendly_name=Bedroom\ Temperature,domain=sensor value=21.4 1715446800000000000

In Arc, after schema inference, that becomes a row in the sensor.temperature table (or whatever your measurement_attr setting produces) with:

  • time (the timestamp HA sent)
  • entity_id (column from the tag)
  • friendly_name (column from the tag)
  • domain (column from the tag)
  • value (column from the field, type-inferred to DOUBLE)

State changes for non-numeric entities (a light turning on, a person arriving home) come through with the state as a string field. HA's integration also writes a value field where it can coerce the state to a number; the rest is in state or named attribute fields.

The exact column set depends on which entities you let through your include/exclude. Run DESCRIBE home_assistant.sensor.temperature (or whatever measurement you're interested in) to see what's actually there.

Queries Worth Running

A few SQL queries that justify having this data in Arc in the first place.

Average temperature per room over the last 24 hours:

SELECT
  entity_id,
  AVG(value) AS avg_temp_c,
  MIN(value) AS min_temp_c,
  MAX(value) AS max_temp_c
FROM home_assistant.sensor.temperature
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY entity_id
ORDER BY avg_temp_c DESC;

Top 10 noisiest entities (the ones flooding your Recorder):

SELECT
  entity_id,
  COUNT(*) AS state_changes
FROM home_assistant.state
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY entity_id
ORDER BY state_changes DESC
LIMIT 10;

The output of that query usually contains at least one sensor you didn't know was firing every two seconds. Now you know. Add it to your exclude and move on.

Kitchen light on-time per day for the last week (assumes you have a light.kitchen entity):

WITH transitions AS (
  SELECT
    time,
    state,
    LAG(state) OVER (ORDER BY time) AS prev_state,
    LAG(time)  OVER (ORDER BY time) AS prev_time
  FROM home_assistant.light
  WHERE entity_id = 'light.kitchen'
    AND time > NOW() - INTERVAL '7 days'
)
SELECT
  DATE_TRUNC('day', prev_time) AS day,
  SUM(EXTRACT(EPOCH FROM (time - prev_time))) / 3600.0 AS on_hours
FROM transitions
WHERE prev_state = 'on'
GROUP BY day
ORDER BY day;

Motion events by hour of day across a month (find your real "active hours"):

SELECT
  EXTRACT(HOUR FROM time) AS hour_of_day,
  COUNT(*) AS motion_events
FROM home_assistant.binary_sensor.motion
WHERE state = 'on'
  AND time > NOW() - INTERVAL '30 days'
GROUP BY hour_of_day
ORDER BY hour_of_day;

None of these queries are possible against HA's default Recorder in any reasonable way. All of them are cheap against Parquet.

Optional: Grafana for the Pretty Charts

SQL in a terminal is great until you want to share it with someone who doesn't read SQL. The Grafana + Arc datasource takes about five minutes to set up. Point a Grafana panel at the home_assistant database, write the same SQL queries, get charts.

Common dashboard panels worth building first: room temperatures over the last 24 hours, motion events per hour over the last week, top 10 most-active entities, and a leaderboard of which lights are on longest each day.

What You've Got

Every state change in your house, now in columnar Parquet you own. Query it from Arc, query it from DuckDB on your laptop, ship it to a notebook, or ignore it for two years and the data will still be there.

When HA ships its next Recorder rewrite, when a database update tanks your SQLite file, when you decide to migrate from one HA install to another, none of it touches your archive. Arc has the data. HA can come and go.

If you do something interesting with this (year-long heatmaps of your house, climate-correlated sleep tracking, "is my fridge actually cycling correctly" forensics), tell us about it in Discord. We collect interesting use cases.


Get started:

Ready to handle billion-record workloads?

Deploy Arc in minutes. Own your data in Parquet. Use for analytics, observability, AI, IoT, or data warehousing.

Get Started ->