Hacker News new | ask | show | jobs
by wjossey 2303 days ago
Strongly disagree.

Understanding your metrics is a key part of so many roles, from devops, to product teams, to marketers...

Yes, you should be automating alerts whenever possible. Yes, you should be putting up key metrics in a visible place so everyone can see how the product is performing.

I can’t tell you how many times I caught an issue because I knew our metrics backwards and forwards, but it didn’t trip an alert threshold. Not every issue follows a pattern easily defined in a check, and human brains are incredible computers capable of helping to fill in that gap.

3 comments

> I can’t tell you how many times I caught an issue because I knew our metrics backwards and forwards, but it didn’t trip an alert threshold.

So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment?

Humans are incredibly powerful, but our whole job as SREs is to make things reliable, repeatable, and scalable. We're doing an industry-wide migration from elegantly hand-crafted LAMP stacks running SSH to Kubernetes and infrastructure-as-code, not because you can't fix problems with SSH (you can, and you can usually fix them faster and better) but because you can't scalably fix problems with SSH. Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring.

It's valuable while you're still small and working out your monitoring to keep a human in the loop - but at some point you need to get rid of that single point of failure. By all means, rely on a human to figure out where your alerting is lacking (just like you rely on a human to write the infrastructure-as-code), but you should eventually not rely on human intervention to actually keep incidents from happening.

You're both right.

Instrumentation and alerts are vital - they leverage inhuman persistence, patience and low cost. But alerts do not substitute for a deep understanding of how your systems work.

A number of the more useful "pre-crime" alerts we have derived from that - if I hadn't been elbow-deep in our systems long enough to notice certain behaviors have non-obvious second- and third-order effects downstream, we wouldn't have the alerts at all.

So, I'm making a bit of a subtle claim - you should absolutely be elbow-deep in your systems, and you should be understanding things well enough to build these sorts of proactive alerts, but you shouldn't rely on people being elbow-deep for noticing problems in real time.

If you're ever at the point where you catch a problem and automated monitoring didn't, that's a bug in automated monitoring. If you are really good at finding new bugs in automated monitoring and more things to monitor because you're spending your time getting a sense of how the system behaves, that's fantastic, keep doing that. (That is one of the good reasons for dashboards IMO - a bunch of data to look at when you've already realized something's wrong. Just don't use dashboards to make the decision that something must be wrong.) If you don't improve your automated monitoring and you're worried things will start failing without humans watching dashboards, then you're not solving your existing bugs.

> but you shouldn't rely on people being elbow-deep for noticing problems in real time.

I completely and unreservedly agree.

> that's a bug in automated monitoring

As part of incident review, we explicitly added a "review monitor performance" step. My favorite part is that the number of times monitors are created, adjusted or complained about post-incident is in itself a highly useful datapoint.

So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment?

That's not a problem with dashboards. That's a problem with training and staffing people.

because you can't scalably fix problems with SSH.

The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

you should eventually not rely on human intervention to actually keep incidents from happening.

He didn't state that the dashboard was the only way his organization kept tabs on things. He indicated that it was only one way, and specifically stated that an alert system also exists.

>That's not a problem with dashboards. That's a problem with training and staffing people.

Training and staffing people to look at dashboards? I've never heard of this and it sounds awful.

"Hey, Mike. On your way to the Keurig, remember to glance at the status panel on the wall and let us know if something doesn't look right, OK?"

Brutal.

Why should Mike have to remember this? Why should all of your infrastructure depend on Mike not getting a text from his wife while walking to the fridge for a La Croix?
I read your comment as sincerely saying that such an arrangement would be "brutal". Looking at your downvotes maybe people think you were being sarcastic?
> That's not a problem with dashboards. That's a problem with training and staffing people.

Again, the whole point of us being computer people is that we think computers can solve problems in repeated, reliable ways. You can run a highly reliable, say, delivery-based bookstore by having a well-staffed group of well-trained human phone operators who pass messages onto human shippers. People did that (and they still do), and it worked. But we have the thesis that you can do this more efficiently and more reliably - in short, that you can deliver more business value - by using computers to automate the process.

> The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

I do fully agree that different companies have different priorities, and in particular I think it's totally fine to rely on humans in the loop while a system is still young (or has just been redesigned) and you don't have a good codified sense of how it behaves yet. However,

1) Wall-based dashboards aren't a best practice, any more than SSHing to production servers is a best practice. It's the right thing for some cases, some of the time. I'd agree with "It's a valuable skill, and it's been useful;" I disagree with "It's so valuable you should make sure everyone does it." If you have the option of either getting good at alerts or getting good at dashboards, spend your time getting good at alerts, first. I'd say the same about infrastructure-as-code vs. SSH-to-prod (and I say this as someone who regularly SSHs to prod and is real good at single-machine old-school sysadminnery).

2) Scalability isn't about absolute size, it's about how much you can do with the resources you have. Small teams and not-yet-profitable teams need to focus more on scalability (in the sense I'm using it) because they simply can't staff enough people to cover up gaps in operability. For example, you're much better off figuring out how to set up HA and automated failover than saying "We're too small for that," setting up a weekly pager rotation with people on call 24 hours a day, and alerting them so much they can't do non-toil work (or worse, burning them out and having them find another job).

Many years ago I was on a ~4-person team at my undergrad computer club running web hosting. We ended up getting popular enough that many real university applications (course websites for submitting assignments, etc.) depended on us. Our priority was that, as students, we couldn't get paged during finals week because our academics would take priority, and yet finals week was the most critical time for the service to stay up. So we got real good at HA, at reproducible deployments and config management, etc. (I remember one time we spun up a new server during finals week - and we didn't have to do any fiddling to add it to the cluster precisely because we'd automated the provisioning process.) We had web pages with graphed metrics to inform our capacity planning, but no dashboards that anyone was expected to stare at, just alerts on full outages.

> Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring.

The way that I took the GP's point was that humans can find things that haven't yet been automated, while automation can't (at least not yet, but I'd argue it'll take AGI for that.)

Yes, I agree with this. But if you're relying on humans to look at dashboards to keep your actual service up in the moment, you're not seriously committing to automating (just like if you SSH to every machine you Terraform to tweak things, you're not really committed to Terraform).

What you should do is rely on automation to detect problems and alert people, and in postmortems, look at graphs and have humans say things like "Hey, this queue kept steadily climbing for three hours before the outage" or "We would have noticed it in this metric but it's so noisy so we can't alert on it" or something. Then you can write more automation (or focus on some prerequisite dev work).

I don't think anyone is arguing that, though. Lots of things humans notice e.g. "we speculatively upped the virtual file system cache and now the service has worse throughput but better high nines response time" is not something you can really build an alert for, and neither is it something you really want an alert for -- but absolutely something that would show up on a dashboard you're intimate with.

In other words, people are not arguing replacing alerts with humans, but rather arguing that continuously looking at your metrics give you a mental model for how your system behaviour changes in response to changes in configuration, whether intentional or not.

Strongly agree (with you).

From the very first formulation of Ubiquitous Computing, the idea of a calmer and more environmentally integrated way of displaying information has held intuitive appeal. Weiser called this “calm computing”.. When information can be conveyed via calm changes in the environment, users are more able to focus on their primary work tasks while staying aware of non-critical information that affects them. Research in this sub-domain goes by various names including “ambient displays”, “peripheral displays”, and “notification systems”...

A Taxonomy of Ambient Information Systems: Four Patterns of Design

https://www.cc.gatech.edu/~john.stasko/papers/avi06.pdf

An automated email is ok but seeing visually a graph flat line or a monitor turn red is much more likely to get noticed.
If you ignore alerting then it's likely that your alerts are too noisy. See "alert fatigue".
Its that nice Mr Googles alerts