| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by geofft 2339 days ago

> That's not a problem with dashboards. That's a problem with training and staffing people.

Again, the whole point of us being computer people is that we think computers can solve problems in repeated, reliable ways. You can run a highly reliable, say, delivery-based bookstore by having a well-staffed group of well-trained human phone operators who pass messages onto human shippers. People did that (and they still do), and it worked. But we have the thesis that you can do this more efficiently and more reliably - in short, that you can deliver more business value - by using computers to automate the process.

> The number of businesses that need to worry about scalability is vanishingly small compared to the number of businesses that don't. Let's not pretend that one company's problems are the same as another's.

I do fully agree that different companies have different priorities, and in particular I think it's totally fine to rely on humans in the loop while a system is still young (or has just been redesigned) and you don't have a good codified sense of how it behaves yet. However,

1) Wall-based dashboards aren't a best practice, any more than SSHing to production servers is a best practice. It's the right thing for some cases, some of the time. I'd agree with "It's a valuable skill, and it's been useful;" I disagree with "It's so valuable you should make sure everyone does it." If you have the option of either getting good at alerts or getting good at dashboards, spend your time getting good at alerts, first. I'd say the same about infrastructure-as-code vs. SSH-to-prod (and I say this as someone who regularly SSHs to prod and is real good at single-machine old-school sysadminnery).

2) Scalability isn't about absolute size, it's about how much you can do with the resources you have. Small teams and not-yet-profitable teams need to focus more on scalability (in the sense I'm using it) because they simply can't staff enough people to cover up gaps in operability. For example, you're much better off figuring out how to set up HA and automated failover than saying "We're too small for that," setting up a weekly pager rotation with people on call 24 hours a day, and alerting them so much they can't do non-toil work (or worse, burning them out and having them find another job).

Many years ago I was on a ~4-person team at my undergrad computer club running web hosting. We ended up getting popular enough that many real university applications (course websites for submitting assignments, etc.) depended on us. Our priority was that, as students, we couldn't get paged during finals week because our academics would take priority, and yet finals week was the most critical time for the service to stay up. So we got real good at HA, at reproducible deployments and config management, etc. (I remember one time we spun up a new server during finals week - and we didn't have to do any fiddling to add it to the cluster precisely because we'd automated the provisioning process.) We had web pages with graphed metrics to inform our capacity planning, but no dashboards that anyone was expected to stare at, just alerts on full outages.