Monitor Your Proxmox Cluster with Telegraf, Arc, and Grafana

#Arc#Proxmox#Telegraf#Grafana#monitoring#virtualization#homelab#metrics
Cover image for Monitor Your Proxmox Cluster with Telegraf, Arc, and Grafana

Proxmox is fantastic for running VMs and containers. The built-in web UI gives you basic graphs for CPU, memory, and network—enough to know if something is on fire. But when you need to understand resource trends over weeks, compare VM performance, or get alerted before your storage fills up, you need something better.

I run Proxmox for my homelab and for some production workloads. After trying various monitoring setups, I settled on Telegraf + Arc + Grafana. Telegraf collects the metrics, Arc stores them efficiently (even months of data stays fast to query), and Grafana gives you dashboards and alerts.

Let's build it.

The Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Proxmox Cluster                          │
│                                                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │   Node 1   │  │   Node 2   │  │   Node 3   │             │
│  │ (pve1)     │  │ (pve2)     │  │ (pve3)     │             │
│  │            │  │            │  │            │             │
│  │ VMs / LXCs │  │ VMs / LXCs │  │ VMs / LXCs │             │
│  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘             │
│        │               │               │                    │
│        └───────────────┼───────────────┘                    │
│                        ▼                                    │
│              ┌─────────────────┐                            │
│              │    Telegraf     │  (on one node or VM)       │
│              │  proxmox input  │                            │
│              └────────┬────────┘                            │
│                       │                                     │
└───────────────────────┼─────────────────────────────────────┘
                        │
                        ▼
               ┌─────────────────┐
               │      Arc        │
               │   Port 8000     │
               └────────┬────────┘
                        │
                        ▼
               ┌─────────────────┐
               │    Grafana      │
               │   Port 3000     │
               └─────────────────┘

Telegraf talks to the Proxmox API to collect metrics from all nodes, VMs, and containers in your cluster. You only need one Telegraf instance—it can monitor the entire cluster via the API.

Prerequisites

  • A Proxmox cluster (or single node)
  • A VM or LXC to run the monitoring stack (or run it on a separate machine)
  • A Proxmox API token for Telegraf

Create a Proxmox API Token

First, create a dedicated user and API token for monitoring. In the Proxmox web UI:

  1. Datacenter → Permissions → Users → Add

    • User name: telegraf
    • Realm: pve (or pam if you prefer)
    • No password needed (API token only)
  2. Datacenter → Permissions → API Tokens → Add

    • User: telegraf@pve
    • Token ID: monitoring
    • Privilege Separation: checked
    • Note the token secret—you'll need it
  3. Datacenter → Permissions → Add → User Permission

    • Path: /
    • User: telegraf@pve
    • Role: PVEAuditor

The PVEAuditor role gives read-only access to all cluster information—exactly what we need for monitoring.

Or via CLI on any Proxmox node:

# Create user
pveum user add telegraf@pve
 
# Create API token (save the output!)
pveum user token add telegraf@pve monitoring --privsep=1
 
# Grant read access to the entire cluster
pveum aclmod / -user telegraf@pve -role PVEAuditor

Docker Compose Setup

Create a docker-compose.yml for the monitoring stack:

services:
  arc:
    image: ghcr.io/basekick-labs/arc:latest
    container_name: arc
    restart: unless-stopped
    environment:
      - STORAGE_BACKEND=local
    volumes:
      - arc-data:/app/data
    ports:
      - "8000:8000"
 
  telegraf:
    image: telegraf:latest
    container_name: telegraf
    restart: unless-stopped
    volumes:
      - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
    environment:
      - PROXMOX_URL=https://your-proxmox-node:8006
      - PROXMOX_TOKEN_ID=telegraf@pve!monitoring
      - PROXMOX_TOKEN_SECRET=your-token-secret-here
      - ARC_TOKEN=your-arc-token-here
    depends_on:
      - arc
 
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-clock-panel
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    depends_on:
      - arc
 
volumes:
  arc-data:
  grafana-data:

Telegraf Configuration

Create telegraf.conf:

[agent]
  interval = "30s"
  round_interval = true
  flush_interval = "10s"
  hostname = ""
  omit_hostname = false
 
# Proxmox VE input plugin
[[inputs.proxmox]]
  base_url = "${PROXMOX_URL}"
  api_token = "${PROXMOX_TOKEN_ID}=${PROXMOX_TOKEN_SECRET}"
 
  ## Skip TLS verification (for self-signed certs)
  insecure_skip_verify = true
 
  ## Response timeout
  response_timeout = "10s"
 
# Optional: collect node-level system metrics via SNMP or node_exporter
# If Telegraf runs ON a Proxmox node, you can add:
# [[inputs.cpu]]
# [[inputs.mem]]
# [[inputs.disk]]
# [[inputs.diskio]]
# [[inputs.net]]
 
# Arc output
[[outputs.arc]]
  url = "http://arc:8000/api/v1/write/msgpack"
  api_key = "${ARC_TOKEN}"
  database = "proxmox"
  content_encoding = "gzip"

What Metrics Does the Proxmox Plugin Collect?

The Telegraf Proxmox input plugin collects:

Node metrics:

  • proxmox_node_cpu - CPU usage percentage
  • proxmox_node_memory_used / memory_total - Memory usage
  • proxmox_node_swap_used / swap_total - Swap usage
  • proxmox_node_disk_used / disk_total - Root filesystem usage
  • proxmox_node_uptime - Node uptime in seconds

VM metrics (QEMU):

  • proxmox_qemu_cpu - vCPU usage
  • proxmox_qemu_memory_used / memory_total - Memory allocation
  • proxmox_qemu_disk_read / disk_write - Disk I/O
  • proxmox_qemu_netin / netout - Network traffic
  • proxmox_qemu_uptime - VM uptime
  • proxmox_qemu_status - Running, stopped, etc.

Container metrics (LXC):

  • proxmox_lxc_cpu - CPU usage
  • proxmox_lxc_memory_used / memory_total - Memory usage
  • proxmox_lxc_disk_read / disk_write - Disk I/O
  • proxmox_lxc_netin / netout - Network traffic
  • proxmox_lxc_uptime - Container uptime
  • proxmox_lxc_status - Running, stopped, etc.

Storage metrics:

  • proxmox_storage_used / storage_total - Storage pool usage
  • proxmox_storage_enabled - Storage availability

All metrics include tags for node_name, vm_name, vmid, storage_name, etc., so you can filter and group easily.

Start the Stack

# Start everything
docker compose up -d
 
# Get Arc admin token from logs
docker logs arc | grep "Admin token"
 
# Update ARC_TOKEN in docker-compose.yml with the token
# Then restart telegraf
docker compose restart telegraf

Check that metrics are flowing:

# Query Arc to see if data is arriving
curl -X POST http://localhost:8000/api/v1/query \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT * FROM proxmox.proxmox_node LIMIT 5"}'

Configure Grafana

  1. Open Grafana at http://localhost:3000 (admin/admin)

  2. Add Arc as a data source:

    • Go to Connections → Data sources → Add data source
    • Search for DuckDB
    • URL: http://arc:8000
    • Auth: Add custom header Authorization with value Bearer your-arc-token
  3. Create a dashboard or import one

Sample Queries for Grafana

Here are some useful queries to get you started:

Cluster Overview - CPU Usage per Node

SELECT
  time,
  node_name,
  cpu * 100 as cpu_percent
FROM proxmox.proxmox_node
WHERE time > NOW() - INTERVAL '6 hours'
ORDER BY time

Memory Usage per Node

SELECT
  time,
  node_name,
  (memory_used / memory_total) * 100 as memory_percent
FROM proxmox.proxmox_node
WHERE time > NOW() - INTERVAL '6 hours'
ORDER BY time

Top 10 VMs by CPU Usage (Last Hour)

SELECT
  vm_name,
  node_name,
  AVG(cpu) * 100 as avg_cpu_percent
FROM proxmox.proxmox_qemu
WHERE time > NOW() - INTERVAL '1 hour'
  AND status = 'running'
GROUP BY vm_name, node_name
ORDER BY avg_cpu_percent DESC
LIMIT 10

Storage Pool Usage

SELECT
  time,
  storage_name,
  node_name,
  (used / total) * 100 as usage_percent
FROM proxmox.proxmox_storage
WHERE time > NOW() - INTERVAL '24 hours'
ORDER BY time

VM Status Summary

SELECT
  status,
  COUNT(*) as count
FROM (
  SELECT DISTINCT ON (vmid) vmid, vm_name, status
  FROM proxmox.proxmox_qemu
  WHERE time > NOW() - INTERVAL '5 minutes'
  ORDER BY vmid, time DESC
)
GROUP BY status

Network Traffic per VM (Bytes/sec)

SELECT
  time,
  vm_name,
  netin,
  netout
FROM proxmox.proxmox_qemu
WHERE time > NOW() - INTERVAL '1 hour'
  AND vm_name = 'your-vm-name'
ORDER BY time

Building a Dashboard

Here's a suggested dashboard layout:

Row 1: Cluster Overview

  • Stat panels: Total nodes, Total VMs running, Total LXCs running, Total storage used
  • Gauge: Average cluster CPU usage

Row 2: Node Health

  • Time series: CPU usage per node (stacked)
  • Time series: Memory usage per node (stacked)
  • Bar gauge: Storage pool usage

Row 3: VM Performance

  • Table: Top 10 VMs by CPU (last hour)
  • Table: Top 10 VMs by memory (last hour)
  • Pie chart: VMs per node distribution

Row 4: Resource Trends

  • Time series: VM count over time (running vs stopped)
  • Time series: Total network traffic (cluster-wide)
  • Time series: Disk I/O per node

Alerting

Set up alerts in Grafana for:

  • Storage > 85% - Time to clean up or expand
  • Node CPU > 90% for 5 minutes - Possible overload
  • Node memory > 95% - Risk of OOM
  • VM stopped unexpectedly - Compare current vs expected state

Example alert query for storage:

SELECT
  storage_name,
  node_name,
  (used / total) * 100 as usage_percent
FROM proxmox.proxmox_storage
WHERE time > NOW() - INTERVAL '5 minutes'
  AND (used / total) > 0.85
GROUP BY storage_name, node_name, used, total

Advanced: Adding ZFS Metrics

If you're using ZFS on your Proxmox nodes (which you should be), you can add more detailed storage metrics. Run Telegraf directly on each Proxmox node with the zfs input plugin:

[[inputs.zfs]]
  poolMetrics = true
  datasetMetrics = true

This gives you:

  • Pool health status
  • Fragmentation percentage
  • Dedup ratio
  • ARC hit/miss rates
  • Dataset-level usage

Advanced: Backup Monitoring

Monitor your Proxmox backups by parsing the backup logs. Add this to Telegraf:

[[inputs.exec]]
  commands = ["/usr/local/bin/backup-status.sh"]
  timeout = "30s"
  data_format = "influx"
  interval = "1h"

With a script that parses /var/log/vzdump/ to extract backup success/failure and duration.

Why Arc for Proxmox Monitoring?

A few reasons I prefer Arc over other time-series databases for this:

  1. Compression - Proxmox generates a lot of metrics. Arc's Parquet storage compresses efficiently, so months of data stays manageable.

  2. Fast queries - When you need to analyze trends over weeks or months, Arc's DuckDB engine handles it without breaking a sweat.

  3. SQL - No proprietary query language. Standard SQL means your Grafana dashboards are easy to build and maintain.

  4. Portable data - Your metrics are stored in Parquet files. If you ever need to analyze them with Pandas, DuckDB CLI, or any other tool, you can.

Resources

Questions? Find me on Discord or Twitter.

Ready to handle billion-record workloads?

Deploy Arc in minutes. Own your data in Parquet.

Get Started ->