Monitor Your Proxmox Cluster with Telegraf, Arc, and Grafana

Proxmox is fantastic for running VMs and containers. The built-in web UI gives you basic graphs for CPU, memory, and network—enough to know if something is on fire. But when you need to understand resource trends over weeks, compare VM performance, or get alerted before your storage fills up, you need something better.
I run Proxmox for my homelab and for some production workloads. After trying various monitoring setups, I settled on Telegraf + Arc + Grafana. Telegraf collects the metrics, Arc stores them efficiently (even months of data stays fast to query), and Grafana gives you dashboards and alerts.
Let's build it.
The Architecture
┌─────────────────────────────────────────────────────────────┐
│ Proxmox Cluster │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ (pve1) │ │ (pve2) │ │ (pve3) │ │
│ │ │ │ │ │ │ │
│ │ VMs / LXCs │ │ VMs / LXCs │ │ VMs / LXCs │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Telegraf │ (on one node or VM) │
│ │ proxmox input │ │
│ └────────┬────────┘ │
│ │ │
└───────────────────────┼─────────────────────────────────────┘
│
▼
┌─────────────────┐
│ Arc │
│ Port 8000 │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Grafana │
│ Port 3000 │
└─────────────────┘
Telegraf talks to the Proxmox API to collect metrics from all nodes, VMs, and containers in your cluster. You only need one Telegraf instance—it can monitor the entire cluster via the API.
Prerequisites
- A Proxmox cluster (or single node)
- A VM or LXC to run the monitoring stack (or run it on a separate machine)
- A Proxmox API token for Telegraf
Create a Proxmox API Token
First, create a dedicated user and API token for monitoring. In the Proxmox web UI:
-
Datacenter → Permissions → Users → Add
- User name:
telegraf - Realm:
pve(orpamif you prefer) - No password needed (API token only)
- User name:
-
Datacenter → Permissions → API Tokens → Add
- User:
telegraf@pve - Token ID:
monitoring - Privilege Separation: checked
- Note the token secret—you'll need it
- User:
-
Datacenter → Permissions → Add → User Permission
- Path:
/ - User:
telegraf@pve - Role:
PVEAuditor
- Path:
The PVEAuditor role gives read-only access to all cluster information—exactly what we need for monitoring.
Or via CLI on any Proxmox node:
# Create user
pveum user add telegraf@pve
# Create API token (save the output!)
pveum user token add telegraf@pve monitoring --privsep=1
# Grant read access to the entire cluster
pveum aclmod / -user telegraf@pve -role PVEAuditorDocker Compose Setup
Create a docker-compose.yml for the monitoring stack:
services:
arc:
image: ghcr.io/basekick-labs/arc:latest
container_name: arc
restart: unless-stopped
environment:
- STORAGE_BACKEND=local
volumes:
- arc-data:/app/data
ports:
- "8000:8000"
telegraf:
image: telegraf:latest
container_name: telegraf
restart: unless-stopped
volumes:
- ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
environment:
- PROXMOX_URL=https://your-proxmox-node:8006
- PROXMOX_TOKEN_ID=telegraf@pve!monitoring
- PROXMOX_TOKEN_SECRET=your-token-secret-here
- ARC_TOKEN=your-arc-token-here
depends_on:
- arc
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=grafana-clock-panel
volumes:
- grafana-data:/var/lib/grafana
ports:
- "3000:3000"
depends_on:
- arc
volumes:
arc-data:
grafana-data:Telegraf Configuration
Create telegraf.conf:
[agent]
interval = "30s"
round_interval = true
flush_interval = "10s"
hostname = ""
omit_hostname = false
# Proxmox VE input plugin
[[inputs.proxmox]]
base_url = "${PROXMOX_URL}"
api_token = "${PROXMOX_TOKEN_ID}=${PROXMOX_TOKEN_SECRET}"
## Skip TLS verification (for self-signed certs)
insecure_skip_verify = true
## Response timeout
response_timeout = "10s"
# Optional: collect node-level system metrics via SNMP or node_exporter
# If Telegraf runs ON a Proxmox node, you can add:
# [[inputs.cpu]]
# [[inputs.mem]]
# [[inputs.disk]]
# [[inputs.diskio]]
# [[inputs.net]]
# Arc output
[[outputs.arc]]
url = "http://arc:8000/api/v1/write/msgpack"
api_key = "${ARC_TOKEN}"
database = "proxmox"
content_encoding = "gzip"What Metrics Does the Proxmox Plugin Collect?
The Telegraf Proxmox input plugin collects:
Node metrics:
proxmox_node_cpu- CPU usage percentageproxmox_node_memory_used/memory_total- Memory usageproxmox_node_swap_used/swap_total- Swap usageproxmox_node_disk_used/disk_total- Root filesystem usageproxmox_node_uptime- Node uptime in seconds
VM metrics (QEMU):
proxmox_qemu_cpu- vCPU usageproxmox_qemu_memory_used/memory_total- Memory allocationproxmox_qemu_disk_read/disk_write- Disk I/Oproxmox_qemu_netin/netout- Network trafficproxmox_qemu_uptime- VM uptimeproxmox_qemu_status- Running, stopped, etc.
Container metrics (LXC):
proxmox_lxc_cpu- CPU usageproxmox_lxc_memory_used/memory_total- Memory usageproxmox_lxc_disk_read/disk_write- Disk I/Oproxmox_lxc_netin/netout- Network trafficproxmox_lxc_uptime- Container uptimeproxmox_lxc_status- Running, stopped, etc.
Storage metrics:
proxmox_storage_used/storage_total- Storage pool usageproxmox_storage_enabled- Storage availability
All metrics include tags for node_name, vm_name, vmid, storage_name, etc., so you can filter and group easily.
Start the Stack
# Start everything
docker compose up -d
# Get Arc admin token from logs
docker logs arc | grep "Admin token"
# Update ARC_TOKEN in docker-compose.yml with the token
# Then restart telegraf
docker compose restart telegrafCheck that metrics are flowing:
# Query Arc to see if data is arriving
curl -X POST http://localhost:8000/api/v1/query \
-H "Authorization: Bearer $ARC_TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT * FROM proxmox.proxmox_node LIMIT 5"}'Configure Grafana
-
Open Grafana at
http://localhost:3000(admin/admin) -
Add Arc as a data source:
- Go to Connections → Data sources → Add data source
- Search for DuckDB
- URL:
http://arc:8000 - Auth: Add custom header
Authorizationwith valueBearer your-arc-token
-
Create a dashboard or import one
Sample Queries for Grafana
Here are some useful queries to get you started:
Cluster Overview - CPU Usage per Node
SELECT
time,
node_name,
cpu * 100 as cpu_percent
FROM proxmox.proxmox_node
WHERE time > NOW() - INTERVAL '6 hours'
ORDER BY timeMemory Usage per Node
SELECT
time,
node_name,
(memory_used / memory_total) * 100 as memory_percent
FROM proxmox.proxmox_node
WHERE time > NOW() - INTERVAL '6 hours'
ORDER BY timeTop 10 VMs by CPU Usage (Last Hour)
SELECT
vm_name,
node_name,
AVG(cpu) * 100 as avg_cpu_percent
FROM proxmox.proxmox_qemu
WHERE time > NOW() - INTERVAL '1 hour'
AND status = 'running'
GROUP BY vm_name, node_name
ORDER BY avg_cpu_percent DESC
LIMIT 10Storage Pool Usage
SELECT
time,
storage_name,
node_name,
(used / total) * 100 as usage_percent
FROM proxmox.proxmox_storage
WHERE time > NOW() - INTERVAL '24 hours'
ORDER BY timeVM Status Summary
SELECT
status,
COUNT(*) as count
FROM (
SELECT DISTINCT ON (vmid) vmid, vm_name, status
FROM proxmox.proxmox_qemu
WHERE time > NOW() - INTERVAL '5 minutes'
ORDER BY vmid, time DESC
)
GROUP BY statusNetwork Traffic per VM (Bytes/sec)
SELECT
time,
vm_name,
netin,
netout
FROM proxmox.proxmox_qemu
WHERE time > NOW() - INTERVAL '1 hour'
AND vm_name = 'your-vm-name'
ORDER BY timeBuilding a Dashboard
Here's a suggested dashboard layout:
Row 1: Cluster Overview
- Stat panels: Total nodes, Total VMs running, Total LXCs running, Total storage used
- Gauge: Average cluster CPU usage
Row 2: Node Health
- Time series: CPU usage per node (stacked)
- Time series: Memory usage per node (stacked)
- Bar gauge: Storage pool usage
Row 3: VM Performance
- Table: Top 10 VMs by CPU (last hour)
- Table: Top 10 VMs by memory (last hour)
- Pie chart: VMs per node distribution
Row 4: Resource Trends
- Time series: VM count over time (running vs stopped)
- Time series: Total network traffic (cluster-wide)
- Time series: Disk I/O per node
Alerting
Set up alerts in Grafana for:
- Storage > 85% - Time to clean up or expand
- Node CPU > 90% for 5 minutes - Possible overload
- Node memory > 95% - Risk of OOM
- VM stopped unexpectedly - Compare current vs expected state
Example alert query for storage:
SELECT
storage_name,
node_name,
(used / total) * 100 as usage_percent
FROM proxmox.proxmox_storage
WHERE time > NOW() - INTERVAL '5 minutes'
AND (used / total) > 0.85
GROUP BY storage_name, node_name, used, totalAdvanced: Adding ZFS Metrics
If you're using ZFS on your Proxmox nodes (which you should be), you can add more detailed storage metrics. Run Telegraf directly on each Proxmox node with the zfs input plugin:
[[inputs.zfs]]
poolMetrics = true
datasetMetrics = trueThis gives you:
- Pool health status
- Fragmentation percentage
- Dedup ratio
- ARC hit/miss rates
- Dataset-level usage
Advanced: Backup Monitoring
Monitor your Proxmox backups by parsing the backup logs. Add this to Telegraf:
[[inputs.exec]]
commands = ["/usr/local/bin/backup-status.sh"]
timeout = "30s"
data_format = "influx"
interval = "1h"With a script that parses /var/log/vzdump/ to extract backup success/failure and duration.
Why Arc for Proxmox Monitoring?
A few reasons I prefer Arc over other time-series databases for this:
-
Compression - Proxmox generates a lot of metrics. Arc's Parquet storage compresses efficiently, so months of data stays manageable.
-
Fast queries - When you need to analyze trends over weeks or months, Arc's DuckDB engine handles it without breaking a sweat.
-
SQL - No proprietary query language. Standard SQL means your Grafana dashboards are easy to build and maintain.
-
Portable data - Your metrics are stored in Parquet files. If you ever need to analyze them with Pandas, DuckDB CLI, or any other tool, you can.
Resources
- Telegraf Proxmox Plugin: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/proxmox
- Proxmox API Documentation: pve.proxmox.com/pve-docs/api-viewer
- Arc Documentation: docs.basekick.net/arc
- Grafana DuckDB Plugin: grafana.com/grafana/plugins/grafana-duckdb-datasource
Ready to handle billion-record workloads?
Deploy Arc in minutes. Own your data in Parquet.
