When your cluster goes quiet, listen harder

The first time a production cluster I owned went truly quiet, I assumed I’d finally fixed something. Pages had stopped, dashboards were green, the deploy from the morning had settled. I went to lunch. By the time I got back, the silence had a shape.

What I’d actually done was lobotomize the alerting. A bad relabel rule, shipped two weeks earlier and rolled out gradually, had been quietly moving more and more series into a metric name nothing was scraping. The cluster wasn’t healthy. It just had no way left to tell me otherwise. I learned, over the next ninety minutes, that there are at least three flavors of quiet, and only one of them is the kind you want.

A particular silence

You can describe a system as healthy in two very different ways. One is positive — a chain of evidence that things are doing what they should. The other is negative — an absence of evidence that anything is wrong. Most teams I’ve worked with optimize ruthlessly for the second and call it the first.

The trouble is that absence is easy to manufacture. A page that didn’t fire could mean nothing went wrong, or it could mean the path between the wrongness and the page is broken. From inside the on-call rotation those two states feel identical. Outside, of course, only one of them lasts.

The most dangerous on-call shifts I’ve ever had are the ones where nothing happens for too long.

Three patterns I watch

1. The drift in cardinality

Healthy metrics pipelines tend to grow in cardinality slowly and lumpily — you ship a new feature, it adds a label, a thousand series appear, things stabilize. Unhealthy ones grow smoothly. A perfectly straight line of new series, day over day, almost always means a relabel rule has gone soft and is generating noise that someone, somewhere, is paying for.

The probe is trivial: a recording rule that counts unique series per metric name, plotted as a daily delta. The interesting cell isn’t “high” — it’s “stable.” Anything growing exactly the same amount every day is a bug pretending to be a trend.

2. The deploy that didn’t change anything

I have a soft rule: a deploy that doesn’t move at least one production graph is a deploy I should be suspicious of. Even tiny changes — a logging level, a renamed binary — should ripple somewhere. When they don’t, the cause is usually one of three things, none of them good:

the change didn’t actually land where you think it did;
the surface you’re watching has gone numb;
the thing you changed has been bypassed for so long no one noticed.

A simple check — diff a small bouquet of post-deploy graphs against the pre-deploy ten-minute window — has caught all three for me, usually within a quarter of trying.

A deploy that doesn’t move at least one graph is a deploy I should be suspicious of.

3. The cluster that’s getting cheaper for no reason

Cost is, oddly, one of the cleanest health signals. A real production system, in real use, tends to cost more this quarter than last — slowly, predictably. When the bill drops without a corresponding effort to make it drop, the most likely explanation is that some part of the system has stopped doing work. That can be benign. Often it isn’t.

The probes I leave running

The point of these probes isn’t to be clever. It’s to be cheap, boring, and impossible to silence accidentally. Each one fits in a single Prometheus rule file and a single Grafana row. Together they cost me, on the largest cluster I run, about $11/month.

- record: aakura:cardinality_per_metric_daily_delta
  expr: |
    (
      count by (__name__) (
        count by (__name__, instance) (up)
      )
      -
      count by (__name__) (
        count by (__name__, instance) (up offset 1d)
      )
    )

I keep a second one for byte-rate per ingress, and a third for “deploys that didn’t move anything,” which is a bit more involved and I’ll write up separately. Together they form a sort of negative space dashboard: not pictures of health, but pictures of where health stopped reporting.

A note on cost

Every time I write something like this, someone asks whether the probes “pay for themselves.” This is the wrong question. The probes don’t catch incidents. They catch the quiet between incidents — the stretches when the system is degrading toward a much louder failure six months out. They’re insurance against being lulled, and the value of insurance is hard to measure until you’ve gone without it.

The honest answer is that I don’t know whether mine have paid for themselves. I do know that I’ve never, in the four years I’ve kept them, been surprised by a quiet on-call shift. That, for me, has been worth eleven dollars a month.

Closing

The systems we build are loud by design — telemetry, traces, logs, alerts, all calibrated to wake us up at 3 a.m. The trouble is that calibration drifts, and the systems we trust most are the ones we monitor least. A small habit of suspicion toward silence, codified into three or four cheap probes, has done more for my sleep than any single tool I’ve adopted.

If your cluster goes quiet this week, take a moment before you congratulate yourself. Listen harder. The peace you’re feeling might be real. Or it might just be that nothing is left in the system that knows how to scream.

— Helsinki, may 2026