Hacker News new | ask | show | jobs
by Axsuul 443 days ago
Cloud is expensive but hardware failures are at least handled gracefully. With coloc you'd have some serious downtime. That means you'd need to get to a certain level of redundancy in order to have coloc make sense.

I'd love to move to coloc for my SaaS but it doesn't feel as resilient. Please correct me if I'm wrong as I'd love to move off the cloud.

3 comments

Enterprise server gear is pretty reliable, and you build your infra to be fully redundant. In our setup, no single machine failure will take us offline. I have 13 machines in a rack running a > 10mm ARR business, and haven't had any significant hardware failures. We have had occasional drive failures, but everything is a RAID1 at a minimum so they are a non issue.

We just replaced our top of rack firewall/proxies that were 11 years old and working just fine. We did it for power and reliability concerns, not because there was a problem. App servers get upgraded more often, but that's because of density and performance improvements.

What does cause a service blip fairly regularly is a single upstream ISP. I will have a second ISP into our rack shortly, which means that whole class of short outage will go away. It's really the only weak spot we've observed. That being said, we are in a nice datacenter that is a critical hub in the pacific northwest. I'm sure a budget datacenter will have a different class of reliability problems that I am not familiar with.

But again, an occasional 15m outage is really not a big deal business wise. Unless you are running a banking service or something, no one cares when something happens for 15m. Heck, all my banks regularly have "maintenance" outages that are unpredictable. I promise, no one relaly cares about five nines of reliability in the strong majority of services.

Sounds great. Yep, what I mean is you will need to make your systems fully redundant before considering cloud if your business depends on reliability and uptime. That usually requires the business to reach a certain scale first.
Sure, but making something redundant is not really that difficult. HAProxy in front N nodes across M racks, ideally in separate DCs, and then a floating IP in front of your HAProxies. Set up hot standby for your DB.

I used to joke that my homelab almost had better reliability than any company I’d been at, save for my ISP’s spotty availability. Now that I have a failover WAN, it literally is more reliable. In the five years of running a rack, I’ve had precisely one catastrophic hardware failure (mobo died on a Supermicro). Even then, I had a standby node, so it was more of an annoyance (the standby ran hotter and louder) than anything.

Hot spares and remote hands will get you a lot.

And when you get down to it, AWS isn't actually that reliable. I thought EBS volumes had magic redundancy foo but it turns out they can fail and they fail in a less obvious way than a regular disk. AWS networking is constantly bouncing and the virtual network adapters just sometimes stop working. They're also runnung old CPUs.

Depending on your workload you may be able pay off your new hardware with just a couple months' savings.

This. With AFRs as they are today and warranty options and remote hands it’s hardly as bad as most people seem to think especially if their past recollection of working with colocation is from 20 years ago
Got any recommended providers?
Equinix and Digital Realty are gold standard especially if you need comprehensive remote hands but $$$. CoreSite is also good and cheaper if you're in the US.
Not sure cloud is necessarily more resilient--imo it's less resilient. On the other hand, it's fully automated with robust APIs so there's easy tools to mitigate failures like node/machine sets (scale sets, scaling groups, auto scaling groups, whatever the provider calls them)

You could use an orchestration solution to help handle automatic failover. There's a handful of container-based options from heavy duty Kubernetes to Docker Swarm and Nomad.

Containers are nice since you can bypass most of the host management where you only need basic security patching and installation of your container runtime. There's also k8s distros like OpenShift to make k8s setup easier if you go that route.

Yep I use orchestration (Nomad) but still you would need hardware redundancy. For example, the database server is currently a single point of failure. In the cloud, if there's a hardware failure, it will simply go down and come back up with a new instance. In coloc, you'd need to have the data center debug and replace hardware which means extended downtime.
Patroni will manage a PG cluster and auto fail over. I've heard of Stolon as well. If you're on k8s, there's a couple good operators that will handle this

I believe paid PG vendors like EnterpriseDB and maybe Crunchy have their own tools

You would not need to have extended downtime. Every major RDBMS that I’m aware of supports standby nodes, and if you want, a full active-active cluster (not recommended, personally).

The downtime is as long as you have your health check monitoring interval set up for.

When using colocation, nothing is stopping people from storing the database data externally from the server running the database like some cloud services do. But doing so, either in cloud or not, does have a serious downside: greatly increased latency.
Kind of defeats the purpose of colocation if you're not also running the database on your own server.