Hacker News new | ask | show | jobs
by reaperducer 2293 days ago
So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment?

That's not a problem with dashboards. That's a problem with training and staffing people.

because you can't scalably fix problems with SSH.

The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

you should eventually not rely on human intervention to actually keep incidents from happening.

He didn't state that the dashboard was the only way his organization kept tabs on things. He indicated that it was only one way, and specifically stated that an alert system also exists.

2 comments

>That's not a problem with dashboards. That's a problem with training and staffing people.

Training and staffing people to look at dashboards? I've never heard of this and it sounds awful.

"Hey, Mike. On your way to the Keurig, remember to glance at the status panel on the wall and let us know if something doesn't look right, OK?"

Brutal.

Why should Mike have to remember this? Why should all of your infrastructure depend on Mike not getting a text from his wife while walking to the fridge for a La Croix?
I read your comment as sincerely saying that such an arrangement would be "brutal". Looking at your downvotes maybe people think you were being sarcastic?
> That's not a problem with dashboards. That's a problem with training and staffing people.

Again, the whole point of us being computer people is that we think computers can solve problems in repeated, reliable ways. You can run a highly reliable, say, delivery-based bookstore by having a well-staffed group of well-trained human phone operators who pass messages onto human shippers. People did that (and they still do), and it worked. But we have the thesis that you can do this more efficiently and more reliably - in short, that you can deliver more business value - by using computers to automate the process.

> The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

I do fully agree that different companies have different priorities, and in particular I think it's totally fine to rely on humans in the loop while a system is still young (or has just been redesigned) and you don't have a good codified sense of how it behaves yet. However,

1) Wall-based dashboards aren't a best practice, any more than SSHing to production servers is a best practice. It's the right thing for some cases, some of the time. I'd agree with "It's a valuable skill, and it's been useful;" I disagree with "It's so valuable you should make sure everyone does it." If you have the option of either getting good at alerts or getting good at dashboards, spend your time getting good at alerts, first. I'd say the same about infrastructure-as-code vs. SSH-to-prod (and I say this as someone who regularly SSHs to prod and is real good at single-machine old-school sysadminnery).

2) Scalability isn't about absolute size, it's about how much you can do with the resources you have. Small teams and not-yet-profitable teams need to focus more on scalability (in the sense I'm using it) because they simply can't staff enough people to cover up gaps in operability. For example, you're much better off figuring out how to set up HA and automated failover than saying "We're too small for that," setting up a weekly pager rotation with people on call 24 hours a day, and alerting them so much they can't do non-toil work (or worse, burning them out and having them find another job).

Many years ago I was on a ~4-person team at my undergrad computer club running web hosting. We ended up getting popular enough that many real university applications (course websites for submitting assignments, etc.) depended on us. Our priority was that, as students, we couldn't get paged during finals week because our academics would take priority, and yet finals week was the most critical time for the service to stay up. So we got real good at HA, at reproducible deployments and config management, etc. (I remember one time we spun up a new server during finals week - and we didn't have to do any fiddling to add it to the cluster precisely because we'd automated the provisioning process.) We had web pages with graphed metrics to inform our capacity planning, but no dashboards that anyone was expected to stare at, just alerts on full outages.