Hacker News new | ask | show | jobs
by dnautics 2233 days ago
> This will invalidate the arguments for Erlang concurrency model.

What about failure domains? As far as I'm concerned, this is the strongest reason for actor-based concurrency. I can design my architecture so that groups of processes that need to die together die together. And it's usually one or two lines of code, if any.

Here's a real life example. I have a process that maintains an SSH connection to a host machine, and that ssh connection is used to query information about running VMs on that host machine. If the SSH connection dies, it kills the process that is tracking the host machine, which in turn kills the processes tracking the associated VMs, without perturbing any of the other hosts' processes or vms. This triggers the host process to be restarted by a supervisor, which then creates a new SSH connection to query for information (possibly repopulating VM processes for tracking information). All of this I wrote zero lines of code for (which, importantly, means I made no mistakes), just one or two configuration options. More importantly, the system doesn't get stuck in an undefined state where complex query failures can cause logjams in the running system.

3 comments

You can tie the fates of threads together in Java using thread groups. If you need more flexibility, or want it to be managed for you, Akka framework offers this. I believe Akka gives you a model very similar to Erlang.

In Java you would create a thread pool and configure it to restart the threads if they die. Each thread would wake up every so often to query SSH and dump their results into a queue. If the query threads die, the processes reading the queue at the other end have nothing to do so they won't execute. Its easy to make a consumer queue that executes some code on another thread whenever data arrives.

Java's exposure of the underlying OS threads and cheap transfer of data between threads lets people build libraries on top that offer memory models used by Erlang and others. Its not built in or quite as convenient, but you can use actors and fibers in Java if you want to.

Yeah that's exactly the problem. It's an afterthought in the system. How certain can you be that the system you're using is compostable with any other code brought in to your system, even from libraries outside? In erlang, failure domains are the raison d'etre of the language, so everything in the ecosystem will play nice.

Ultimately, systems like akka are extremely complicated to get right, even for experts, because you have to think about all of the vm bits underneath. I can (and have) teach a junior programmer basic OTP concepts with the confidence that they can't mess things up. Now, they wouldn't be able to come up with the architecture I designed as a good idea, but I could tell them to implement it (with tests!) and expect them to get it right.

That's what exceptions are for, no? If a connection dies an exception is thrown that would propagate up to the top of the thread stack. You'd then catch it and sit in a loop re-establishing the SSH connection, or terminating with a signal to whatever thread started the monitoring thread that it was dying. The act of unwinding the stack would pass through the finally handlers, closing open resources and cleaning up, before the loop starts again.

The failure domain here isn't precisely defined because shared data is allowed (but not required). You could define it as "anything reachable from the thread/fiber stack".

No. If you try to use exceptions to guard your failure domains in this fashion you will not have a good time.
A current discussion on the Loom mailing list is about providing Structured Concurrency [1] primitives.

It would allow you to write something like:

    try (var scope = FiberScope.open(Option.PROPAGATE_CANCEL)) {
        var fiber1 = scope.schedule(() -> sshKeepAlive());
        var fiber2 = scope.schedule(() -> trackHost());
        var fiber3 = scope.schedule(() -> trackVMs());
    }
With the garantee that if any fiber fails (which you bind to cancelling it), all others will be cancelled.

[1] http://250bpm.com/blog:71