Hacker News new | ask | show | jobs
by Ankhers 3134 days ago
Not quite. Erlang's VM (BEAM) uses supervision trees in order to keep track of the various processes and their crashes. Each supervisor and worker in the tree is it's own process. So, if you have 5 supervisors, and each of those start 2 worker processes, you will have 15 processes in your supervision tree (the VM will spawn a bunch for itself to use, so do not think you only have 15 processes running in your system).

Basically, at the top level of your application, you will have a supervisor that will look after all of the processes that are important to your application. Each of these processes could have any kind of functionality (e.g., database connection, HTTP server, etc), or be another supervisor. When you start these applications, they too may start a supervision tree of processes that are important to them (e.g., the database connection may actually start a pool of processes).

In "fail fast" or "let it crash", only the process that actually threw the exception will die. The supervisor that is looking after that process will be notified of it being killed and, depending on how you have the supervisor configured, it may or may not start a new process to replace the one that died.

Another thing to note, depending on how the supervisor is configured, it may actually crash if a particular process it is monitoring crashes too many times. This will make the supervisor crash and it should bubble up to it's supervisor. Unfortunately, it is possible to take down your entire application this way.

TLDR: There is no master process that does all of this. Though, each supervisor is sort of a master process for each of its supervisors and workers and the processes a supervisor watches may or may not be restarted upon failure.

1 comments

So, is there a particular strategy to organizing code in order to 'hot-swap' it (failing code) out, while keeping a production system up and running?
Pretty much! I haven't played with that feature myself, but Erlang's telecom origins help explain this feature. If you're upgrading a telephone switch with N live calls, it'd be optimal to not have to kill those calls just to upgrade some software. There's more nuance to it than that, but "little-to-no downtime", or hot swappable code, is a language feature. Pretty neat idea in an era of "throw away the whole VM/container" to push a config update.
So, there's maybe two parts to your question? How do you structure your code to make it possible to hot load code -- and how does that help you recover from crashes.

The beam VM allows for an old version and a current version of all modules. When you call into a function with a fully qualified name (Module:Function), it always calls into the current version; if you call a function within a module only by its function name, it calls into the same version that is executing, which could be the old version. So, you need to periodically (or on demand via some message) make a fully qualified call, to ensure your process will migrate. You also need to make sure the old version doesn't stay on the stack, so you have to be tail recursive, at least sometimes. You also need to make sure you make your new code able to cope with state developed by old code, which can be challenging at times.

If your service is generally stable, but occasionally crashes with some types of requests, then you're in a good place. If something is crashing a lot, it can cascade into a supervisor crash, and it is likely that you will have a bad day. In theory, when your service starts (started by you, or if the supervisor restarts it), it has a consistent state, and will be able to service requests; but often it started crashing because some service it requires stopped working right, and restarting the client doesn't really help.

I've found let it crash is a good philosophy, but shouldn't always be implemented literally. In an http server, I'd rather catch crashes, log them and return an error to the client -- not just close the socket. In erlang server processes that don't maintain much state running in pg2, it's better to catch and log, because requests are going to be lost if you actually crash.