Hacker News new | ask | show | jobs
by kgeist 1184 days ago
There are several reliability issues:

  1) a single panic/exception/segfault in the executable brings down the whole website and so it will be unavailable until the executable restarts

  2) entropy *always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)

  3) deployments are tricky if there's nothing before the executable (stop, update, restart => downtime)

  4) if cache is in-process, on a restart it will have to be repopulated from scratch, leading to temporary slowdowns (+ and maybe a thundering herd problem) which will happen *every time* you deploy an update
I think much of it is ignoreable if the site is just a personal blog or a static site. But if the site is a real time "web application" which people rely on for work, then you still need:

  1) some kind of containerization, to deal with inevitable entropy (when a container is restarted, everything is back to the initial clean state)

  2) at least two instances of the application: one instance crashes => the second one picks up traffic; or during rolling updates: while one instance is being killed and replaced with a new version, traffic is routed to another instance

  3) persistent data (and sometimes caches) need to be replicated (and backed up) -- we've had many hardware issues corrupting DBs

  4) automatic failover to a different machine in case the machine is dead beyond repair
>not some external monster tool like k8s

What can you use instead of k8s for this kind of scenario? (an ultra reliable setup which doesn't need a whole cluster)

5 comments

It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly, just look at Twitter, both its old failwhale and new post-Musk fragile state. Complexity, on the other hand, and thus lower iteration speed and higher fixed costs can kill a business much easier than a few seconds of downtime here and there.

You don't need an "ultra reliable setup" or even a "cluster". You can have one nginx as a load balancer pointing at your unicorn/gunicorn/go thing, it's very unlikely to ever go down. You can run a cronjob with pgdump and rsync, in an off chance your server dies irrecoverably corrupting the DB (which is really unlikely for Postgres), chances are your business will survive fifteen minutes old database.

Most "realtime web applications" are not aerospace, even though we like to pretend that's what we work on. It's an interesting confluence of engineering hubris and managerial FOMO that got us here.

> It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly

That may be true for social media apps where the Terms of Service don't include any SLAs/SLOs, but if you're a SaaS company of any kind, the agreements with clients often include uptime requirements. Their engineers will often consider some form of "x number of nines" industry standard.

In the projects I work on, things go down all the time, for various reasons (hardware issues, networking problems, cascading programming errors). It's the various additional measures we have put in place which prevent us from having frequent outages... Before the current system was adopted, poor stability of our platform was one of the main complaints.

I agree that for many projects it may be an overkill.

Networking issues and even hardware issues are very unlikely if you can fit everything into one box, and you can get a lot in one box nowadays (TB+ RAM, 128+ core servers are now commodity). MTBF on servers is on the order of years, so hardware failure is genuinely rare until you get too many servers into one distributed system. And even then, two identical boxes (instead of binpacking into a cluster, increasing failure probability) go a very long way.

It's a vicious circle. We build distributed multi-node systems, overlay software-configured networks, self-healing clusters, separate distributed control planes, split everything into microservices, but it all makes systems more fragile unless enough effort is spent on supporting all that infrastructure. Google might not have a choice to scale vertically, but the overwhelming majority of companies do. Hell, even StackOverflow still scales vertically after all these years! I know startups with no customers who use more servers than StackOverflow does.

Re: Crashes.

If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart. Specially when the users keep repeating the action that triggered the crash.

Re: Entropy. Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable.

Re: caches. There are two types of caches: indicies that are persisted with the database, and LRU caches in memory. LRU caches are always built on demand so this is not even a problem.

Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second.

>If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart.

Not necessarily so. Many bugs are pretty rare bugs which are triggered only under specific conditions (a user, or the system, must do X, Y, Z at the right moment). So it doesn't happen all the time. But when it happens, the whole server crashes or starts behaving in a funky way and other users are affected. Sure you may say if it's a rare bug, then users will be rarely affected. But we don't have a single bug like that, there's always N such bugs lurking around (we never know how many of them in a large application); multiply it by N bugs and you have server crashes for different reasons quite often, making your paying customers dissatisfied. It also assumes you can fix such a bug immediately while it's not always true, there's often Heisenbugs it takes weeks to root out and fix, while your customers are affected (sure the application will restart but ALL users (not just the one who triggered the bug) can loose work, get random errors when the app is not available -- not a good experience). So having several app instances for backup allows to soften such blows, because there will always be at least one app instance which is available.

>Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable

I agree that entropy increases with complex setup, but there's also base entropy which accumulates simply because of time (which I think is more dangerous). Like make a sufficient number of changes to the setup of your application (which you often need if you release often) and eventually someone or something somewhere will make a mistake or expose a bug somewhere, and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort. We've had issues like that with our non-containerized deployments and it's a very complex and error-prone undertaking to do it flawlessly (no downtime or regressions) compared to containerized deployments.

>Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second

Hm, usually caches are placed in front of disk-based DB's to speed up I/O, i.e. it's not a matter of slow CPU's, it's a matter of slow I/O. Rebuilding everything which is in the caches from DB sources is not super fast.

> and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort.

Automated deployment including server bringup is orthogonal to using containers or hot failover. For example at $WORK we're deploying Unreal applications to bare metal windows machines without using containers because windows containers aren't as frictionless as linux ones and the required GPU access complicates things further.

Note that you can totally have more than one instance of the same app/binary running on the same machine. You don't even need containers for that.
But then you need some kind of load balancer, which hsn915 said was "too complicated".
Upfront customer requirements often say they want >99.5% uptime (which allows for 3.5h downtime a month anyway) or some such. In practice B2B customers often don't care much if hour-long downtimes happen every week during off-hours. Sometimes they're even ok when it gets taken down over a whole weekend. Things serving the general public have different requirements but even they have their activity dips during the late night where business impact of maintenance is much lower.
> 2) entropy always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)*

This is not what entropy means. Even if you constrain it to hardware, there is no reason to think that this will happen eventually, unless your timeline is significantly long.

Also, there are typically multiple processes. A panic stops only one process.