| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by geofft 2303 days ago

> I can’t tell you how many times I caught an issue because I knew our metrics backwards and forwards, but it didn’t trip an alert threshold.

So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment?

Humans are incredibly powerful, but our whole job as SREs is to make things reliable, repeatable, and scalable. We're doing an industry-wide migration from elegantly hand-crafted LAMP stacks running SSH to Kubernetes and infrastructure-as-code, not because you can't fix problems with SSH (you can, and you can usually fix them faster and better) but because you can't scalably fix problems with SSH. Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring.

It's valuable while you're still small and working out your monitoring to keep a human in the loop - but at some point you need to get rid of that single point of failure. By all means, rely on a human to figure out where your alerting is lacking (just like you rely on a human to write the infrastructure-as-code), but you should eventually not rely on human intervention to actually keep incidents from happening.

3 comments

_jal 2303 days ago

You're both right.

Instrumentation and alerts are vital - they leverage inhuman persistence, patience and low cost. But alerts do not substitute for a deep understanding of how your systems work.

A number of the more useful "pre-crime" alerts we have derived from that - if I hadn't been elbow-deep in our systems long enough to notice certain behaviors have non-obvious second- and third-order effects downstream, we wouldn't have the alerts at all.

link

geofft 2303 days ago

So, I'm making a bit of a subtle claim - you should absolutely be elbow-deep in your systems, and you should be understanding things well enough to build these sorts of proactive alerts, but you shouldn't rely on people being elbow-deep for noticing problems in real time.

If you're ever at the point where you catch a problem and automated monitoring didn't, that's a bug in automated monitoring. If you are really good at finding new bugs in automated monitoring and more things to monitor because you're spending your time getting a sense of how the system behaves, that's fantastic, keep doing that. (That is one of the good reasons for dashboards IMO - a bunch of data to look at when you've already realized something's wrong. Just don't use dashboards to make the decision that something must be wrong.) If you don't improve your automated monitoring and you're worried things will start failing without humans watching dashboards, then you're not solving your existing bugs.

link

_jal 2302 days ago

> but you shouldn't rely on people being elbow-deep for noticing problems in real time.

I completely and unreservedly agree.

> that's a bug in automated monitoring

As part of incident review, we explicitly added a "review monitor performance" step. My favorite part is that the number of times monitors are created, adjusted or complained about post-incident is in itself a highly useful datapoint.

link

reaperducer 2303 days ago

So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment?

That's not a problem with dashboards. That's a problem with training and staffing people.

because you can't scalably fix problems with SSH.

The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

you should eventually not rely on human intervention to actually keep incidents from happening.

He didn't state that the dashboard was the only way his organization kept tabs on things. He indicated that it was only one way, and specifically stated that an alert system also exists.

link

tyrust 2303 days ago

>That's not a problem with dashboards. That's a problem with training and staffing people.

Training and staffing people to look at dashboards? I've never heard of this and it sounds awful.

link

reaperducer 2303 days ago

"Hey, Mike. On your way to the Keurig, remember to glance at the status panel on the wall and let us know if something doesn't look right, OK?"

Brutal.

link

InvisibleCities 2303 days ago

Why should Mike have to remember this? Why should all of your infrastructure depend on Mike not getting a text from his wife while walking to the fridge for a La Croix?

link

tyrust 2302 days ago

I read your comment as sincerely saying that such an arrangement would be "brutal". Looking at your downvotes maybe people think you were being sarcastic?

link

geofft 2303 days ago

> That's not a problem with dashboards. That's a problem with training and staffing people.

Again, the whole point of us being computer people is that we think computers can solve problems in repeated, reliable ways. You can run a highly reliable, say, delivery-based bookstore by having a well-staffed group of well-trained human phone operators who pass messages onto human shippers. People did that (and they still do), and it worked. But we have the thesis that you can do this more efficiently and more reliably - in short, that you can deliver more business value - by using computers to automate the process.

> The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

I do fully agree that different companies have different priorities, and in particular I think it's totally fine to rely on humans in the loop while a system is still young (or has just been redesigned) and you don't have a good codified sense of how it behaves yet. However,

1) Wall-based dashboards aren't a best practice, any more than SSHing to production servers is a best practice. It's the right thing for some cases, some of the time. I'd agree with "It's a valuable skill, and it's been useful;" I disagree with "It's so valuable you should make sure everyone does it." If you have the option of either getting good at alerts or getting good at dashboards, spend your time getting good at alerts, first. I'd say the same about infrastructure-as-code vs. SSH-to-prod (and I say this as someone who regularly SSHs to prod and is real good at single-machine old-school sysadminnery).

2) Scalability isn't about absolute size, it's about how much you can do with the resources you have. Small teams and not-yet-profitable teams need to focus more on scalability (in the sense I'm using it) because they simply can't staff enough people to cover up gaps in operability. For example, you're much better off figuring out how to set up HA and automated failover than saying "We're too small for that," setting up a weekly pager rotation with people on call 24 hours a day, and alerting them so much they can't do non-toil work (or worse, burning them out and having them find another job).

Many years ago I was on a ~4-person team at my undergrad computer club running web hosting. We ended up getting popular enough that many real university applications (course websites for submitting assignments, etc.) depended on us. Our priority was that, as students, we couldn't get paged during finals week because our academics would take priority, and yet finals week was the most critical time for the service to stay up. So we got real good at HA, at reproducible deployments and config management, etc. (I remember one time we spun up a new server during finals week - and we didn't have to do any fiddling to add it to the cluster precisely because we'd automated the provisioning process.) We had web pages with graphed metrics to inform our capacity planning, but no dashboards that anyone was expected to stare at, just alerts on full outages.

link

pjmorris 2303 days ago

> Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring.

The way that I took the GP's point was that humans can find things that haven't yet been automated, while automation can't (at least not yet, but I'd argue it'll take AGI for that.)

link

geofft 2303 days ago

Yes, I agree with this. But if you're relying on humans to look at dashboards to keep your actual service up in the moment, you're not seriously committing to automating (just like if you SSH to every machine you Terraform to tweak things, you're not really committed to Terraform).

What you should do is rely on automation to detect problems and alert people, and in postmortems, look at graphs and have humans say things like "Hey, this queue kept steadily climbing for three hours before the outage" or "We would have noticed it in this metric but it's so noisy so we can't alert on it" or something. Then you can write more automation (or focus on some prerequisite dev work).

link

kqr 2303 days ago

I don't think anyone is arguing that, though. Lots of things humans notice e.g. "we speculatively upped the virtual file system cache and now the service has worse throughput but better high nines response time" is not something you can really build an alert for, and neither is it something you really want an alert for -- but absolutely something that would show up on a dashboard you're intimate with.

In other words, people are not arguing replacing alerts with humans, but rather arguing that continuously looking at your metrics give you a mental model for how your system behaviour changes in response to changes in configuration, whether intentional or not.

link