The cheapest observability stack I'd actually trust on call

There’s a particular flavor of dread to opening a metrics bill for the first time. You know what’s coming — every label you got lazy with, every dashboard you forgot to retire, every panel that ships a thousand series for one user. The number is always larger than you expected, and the conversation that follows is always shaped the same way: someone proposes ripping out the whole stack and switching providers.

I’ve now done that conversation enough times to be tired of it, and I’ve spent the last quarter building the simplest, cheapest observability stack I’d happily defend in a room full of skeptics.

What “trust on call” actually means

When I say I trust a stack on call, I mean three boring things:

it tells me when something is wrong, with enough context that I can act in the first five minutes;
it stays up when the thing it’s measuring is also falling over;
it doesn’t bankrupt me in a quarter where traffic doubles.

That third one is the one that quietly kills most setups. You build something nice in a green field, traffic grows, the bill grows faster, and one Friday afternoon someone in finance asks a question you don’t want to answer.

The stack

I’ll keep this short. Four pieces, one VM, one S3 bucket, one Grafana.

Prometheus for short-term metrics, 24h retention, scraping everything that breathes.
VictoriaMetrics as the long-term store. Single binary, single VM, S3 backups. Boring.
Loki for logs, with a strict labels policy and a hard cap on cardinality.
Grafana as the single pane.

That’s it. No agent fleet, no managed tier, no per-event pricing. Total spend on the largest deployment I’ve touched: $84/month, of which $61 is the VM.

What broke first

The cardinality policy. I shipped it on day one as a soft suggestion and watched, three weeks later, as a single misconfigured exporter blew it out by an order of magnitude. I now treat the cardinality budget as a hard cap, enforced by a recording rule that pages me at 90% of the limit. It has paid for itself twice in the last six months.

Dashboards I’d actually use

The four dashboards I keep up at all times:

Service is alive — black-box probe, latency histogram, error rate.
Capacity headroom — CPU, memory, file descriptors, with thresholds.
Deploy diff — pre/post deploy graphs side by side, autogenerated.
Cost & cardinality — the budget, the burn rate, the ten loudest series.

Everything else is a derivative. If you can’t fit your truth onto four dashboards, you don’t have an observability problem — you have a clarity problem.

— Helsinki, april 2026