Hacker News new | ask | show | jobs
by waf 3723 days ago
I'm really interested in BEAM languages, but the fault-tolerance / supervisor aspect of it doesn't speak to me. Aren't all modern application fault-tolerant, as long as you don't design something really poorly?

For example, I've never had a single HTTP request bring down an entire website -- that's already isolated. Same with message-queue listening processes. For general batch applications, I've always had them short-lived and running periodically, e.g. every minute, so even a complete crash there is isolated between runs.

One powerful aspect is how it strongly encourages you to design loosely-coupled message-passing systems that should be easier to scale out. But I'm not convinced that's enough to warrant a switch.

5 comments

Erlang's fault-tolerance becomes really useful when you wrote a server that manage hundreds of thousands of simultaneous connections (a chat server being the typical example). With Erlang, each connection is managed by its own lightweight process (no callbacks, no promises, etc.). If a lightweight process fails, it doesn't bring down the other processes. Moreover, BEAM can signal other processes about the failed process (Erlang' supervision trees are based on this mechanism).

In a traditional architecture, you would use one thread for each connection (let's ignore the issue of the memory used by each thread), but when one thread fails, it would bring down all connections instead of just the failing one.

It's more so, you don't need to write defensive code and you're actively encouraged not to... The mantra is akin to "let it fail, it will recover."

Once you start doing it, it becomes more apparent what the difference is... And that's not to say you can't design fault tolerant systems, it's just "easier" to do so with Erlang/BEAM.

How fault-tolerant is your web server when datacenter has power outage? You can't build fault-tolerant system with one computing node by definition.

That means that if you planning to provide proper availability you want to work with system of applications, not just one. That means you should look into creating networking architecture that can handle all of that.

Most of the time it means that people just use tools that solve that for you, like load balancers. But it doesn't mean that somehow all modern applications are immune to failures.

The fault tolerance allow you to have processes and state to be available reliably for longer than the duration of a HTTP request.

You can have continuously running processes without relying on something outside of the language. You can more easily distribute such code as an Elixir package. The code can work without relying on e.g. cron or redis being available and configured.

So a side effect of the fault tolerance is that you can also easily redeploy small parts of your app. So if you have a small logic bug, you can circumvent bringing down the entire app to fix it.

There's also some more serious stuff like your workers getting killed by the OS for whatever reason and you might need to go in and restart it.

You can do this through most queue-based systems in other languages but having everything be built-in is useful.