Hacker News new | ask | show | jobs
by lelag 605 days ago
It's not the same product, even if you consider just virtual machines rather than higher level services that others commenters are referring to. Sure public cloud is more expensive but you pay for the reliability of not being bound to physical hardware. When you buy a dedicated machine from OVH or Hetzner, you get a great deal for the compute power, but if something goes wrong with the hardware, you're stuck waiting for a technician to fix it.

Take the recent Lichess downtime, for example. Their main server had a hardware issue that required physical intervention. This meant the site was down for over 10 hours, and there wasn't much they could do except wait for OVH to send a tech.

If Lichess had been on AWS, the provider would have automatically moved their workload to a functioning server, and the outage would have been much shorter or possibly avoided altogether.

For Lichess, a non-profit, this tradeoff still make sense. Their service, while important to its users, isn't critical. Nobody dies if Lichess is down and the cost savings help them keep running. But if your business can't afford downtime, the extra guarantees from a public cloud provider can definitely be worth paying for.

1 comments

>Take the recent Lichess downtime, for example. Their main server had a hardware issue that required physical intervention. This meant the site was down for over 10 hours, and there wasn't much they could do except wait for OVH to send a tech.

If you not a HN person with systemadmin skills yes. But is NOT that hard to have in house RADI hd setup, with failover server. Or failover NAT gateway. AWS and cloud provider are just a rip off.

It is hard.

Lichess admins are highly skilled and I'm sure they already have a well designed infrastructure. You can see what they use at https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSk...

The issue was on a network equipment that they didn't even manage. You can't load balance when your core network is down. There was nothing they could do as I understand it.

More details at: https://lichess.org/@/Lichess/blog/post-mortem-of-our-longes...

Their architecture is not fault-tolerant. If one server goes down and the whole system goes down, then it was not designed to be fault-tolerant.

I have been running fault-tolerant systems spread across multiple dedicated servers (inside system with multiple DB/KV stores distributed/replicated/sharded, Kafka etc). If one server experiences hardware failure, the system will automatically recover within seconds to minutes (depending on which server/part of service failed) without any data loss.

It's not that hard. You need the knowledge, but it's not rocket science.

Even something as magical as a RAID won't make a technician instantly teleport to your server, power it down in zero seconds, swap out the hard drive and boot it back up in another zero seconds.

OPs comment is valid - physical servers might incur downtime.

But I do agree with your sentiment. "Downtime" is not an argument which should tilt the discussion towards either physical servers or the cloud. AWS data centers famously also have outages, while physical servers often have uptimes of multiple years. So what's better? It's hard to tell, but at the very least, none of these solutions is downtime-free.

No, but if you have backups and DR set up, most hyperscalers will just automatically move your workload someplace else upon failure within minutes (state management complexity notwithstanding—you need to architect for that).