Hacker News new | ask | show | jobs
by parliament32 2200 days ago
I love how there's this myth that servers and services just blow up every 10 minutes 24/7 and unless you have a legion of ops personnel you're going to get hours of downtime each year.

Servers, for the most part, just work. In DC climate-controlled environments, hardware failures is exceedingly rare. Apart from harddrives, most hardware will happily tick along for a decade, if not longer.

Sane production-grade OSes (read: not Ubuntu) will also happily run for literal years with zero human intervention. For obvious reasons, it's a bad idea to not patch your systems, but things will continue to "just work" pretty much forever unless you're running really shitty code.

For renting vs buying servers, there's upsides and downsides. Buying gear is far far cheaper if you plan to be around for more than a year, but renting dedicated servers gives you a lot more flexibility -- to provision a new server, you hit a button in their online panel, wait 15 minutes, then let your deployment strategy take care of the rest.

I find it almost mind-boggling that AWS and friends have convinced people that it's normal to spend ridiculous amounts of money for fairly "meh" service specs in what's essentially VMs.

1 comments

The points you make are fine but I think the experience becomes more painful linearly with the number of servers you manage, since you're N times more likely to see something happen that takes down a server. It just happens more frequently. At some point that becomes often enough that you don't want to deal with it anymore.
I don't think you understand the sheer scale you need to be experiencing a failure more often than once a month. By my anecdotal experience you'd need at least 1k servers for that to happen... and if your company is big enough for $2MM capex for servers alone you can handle $100 remote hands and 30 minutes of engineer time.

Not to mention that at that scale you have plenty of redundancy and, if your ops team knows what they're doing, automagic failover / HA. Anything that happens can easily "wait till Monday", no need for 24/7 anything.

If it's often enough to be noticeable, your scale is large enough to pay someone to be ops full time.