The cheapest observability stack I'd actually trust on call
Grafana, Loki, Prometheus, and a single VictoriaMetrics box. Three months of tuning, an honest cost sheet, and the dashboards I wish someone had handed me on day one.
Field reports on the systems I keep alive — Kubernetes that won't behave, observability that finally tells the truth, and the small rituals that turn 3 a.m. pages into something almost peaceful.
Silence on a control plane is rarely peace. Three patterns I've learned to watch for — and the lightweight probes I now keep running on every cluster I touch.
Grafana, Loki, Prometheus, and a single VictoriaMetrics box. Three months of tuning, an honest cost sheet, and the dashboards I wish someone had handed me on day one.
Runbooks are how teams lie to themselves about preparedness. Here's the format that actually got opened during incidents — and the three sections I deleted from every template I had.
A long honest look at every shared module I've shipped over four years, what broke, what calcified, and the small ergonomic shifts I'd make if I started again tomorrow.
The strongest postmortems I've read all share a structure that has nothing to do with the templates we keep handing each other. Here's what they share — and a worked example.
Yes, even now. Even with Go, even with Python, even with the LLM in your editor. Three places I keep reaching for shell, and the one rule I never break.
A retrospective on a six-week database move that became a four-month one, and what I learned about scope, splash damage, and the very specific kind of optimism that makes engineers dangerous.
Five minutes, four questions, one rule. The simplest piece of process I've ever introduced and somehow the one I'm most quietly proud of.