Hacker News new | ask | show | jobs
by Nextgrid 1604 days ago
Is downtime an actual, business-killing problem in practice? In my experience, very rarely so, and clouds also have downtime (over which you have much less control) and seems like we live with it just fine.

> when you need a 3 hours downtime on prod because you need to reboot and reconfigure your services

What about when your RDS instance fails and is then stuck on "modifying" for an indefinite period of time (ended up being 12 hours, and I suspect an AWS engineer eventually did a manual operation to fix it) and you have to restore from a backup and rebuild the missing data manually from other sources such as logs in the meantime just to get back online? I've seen it happen and would've much preferred having the option to SSH in and recover it manually.

Scaling is less of a problem when bare-metal is so cheap that you can significantly overprovision and never have to worry about autoscaling. This also means you need much less moving parts that can break and take your service down.

I'm not saying that the cloud is always bad, but a hybrid approach would be the most pragmatic choice. For raw compute and bandwidth, bare-metal is orders of magnitude cheaper. You can still use the cloud's managed services from those if you need them, though given how cheap bare-metal is you may realize that you no longer need a lot of them.

When it comes to management/sysadmin work, every shop that uses the cloud beyond very small projects that are fully on a PaaS such as Heroku has a dedicated DevOps person (or more), no different from bare-metal in terms of effort. I'd argue it's more effort than bare-metal because clouds and their associated services, APIs and tooling (Terraform, etc) change much more frequently than old-school Linux and hardware.

2 comments

> Is downtime an actual, business-killing problem in practice? In my experience, very rarely so

Of course it is. If you make a service that other people use as part of their workflows or business, they will be switching providers if you’re the only one that routinely goes down.

This is also a slippery slope. If your engineering team is in the habit of shrugging off downtime as no big deal, it tends to get worse and worse as time goes on, staff turns over, systems scale up, and load increases. If you can’t manage to keep downtime to a minimum when you’re small, it’s going to be much worse when you’re bigger.

> Is downtime an actual, business-killing problem in practice?

Yes, if the business is an enterprise, unscheduled downtime can result in significant losses. It's why disaster recovery is a serious business.

Of course, there are businesses where uptime is absolutely critical, though I'd argue a lot of those already operate their own hardware for that reason (and would benefit little from moving to the cloud) or already have a cloud-based, distributed (multi AZs or multi-cloud even) system in place.

But is this actually the case of most companies? The AWS outages always have major ripple effects across the internet, suggesting that a lot of companies don't actually do what is needed to guarantee uptime and manage to survive and succeed despite that.

I worked at a company that had Target and other Fortune 500 retailers as clients, and we had very strict SLAs with financial penalties if we broke them. There was absolutely the possibility that we could have ended up our clients more than they paid us.