Hacker News new | ask | show | jobs
by jchb 1917 days ago
> You're just sending messages, so you can send a message over the wire and it would be as simple to the developer as sending it locally

If you are expecting a reply, or some side-effect in another system, as a result of the message you sent (and usually you would expect that, otherwise you wouldn't send the message in the first place) then it's not that simple. If the actor is in the same OS process on the same machine, then message delivery is reliable, and you know you'll either get a reply OR a signal that the other actor died. If, on the other hand, the actor is on another machine across the network then the semantics are different. You cannot always differentiate between the remote actor dying and an intermittent network connection error. So you need to take that into account in your protocol design - for example by making operations idempotent.

I've many times seen Erlang code where the developers didn't make this distinction - because the message passing operation looks the same, remote or not - and as a result the system is not resilient to network failures.

1 comments

That's an important point, and didn't mean to make the problem of distributed actors appear trivial. In my mind, all the more reason to consider the distributed case when baking in actors into your language. For example, what distributed primitives would the language support vs leave modular for libraries?

>I've many times seen Erlang code where the developers didn't make this distinction - because the message passing operation looks the same, remote or not - and as a result the system is not resilient to network failures.

It's not about the message passing operation looking the same. The developers erroneously assumed that message delivery was guaranteed. The core issue here is not unique to actor systems.

Erlang/Elixir and Akka do not guarantee message delivery (even for a local case) from what I understand; guaranteed message delivery may mean different things in different contexts (like message queued in mailbox vs message is received from mailbox). In my opinion, developers should always program defensively when writing networked applications using something like the circuit-break pattern or making operations idempotent as you mentioned. When using actor systems, developers should not assume guaranteed message delivery unless the tool allows them to.

I'm not familiar with other actor systems so I can't speak on their guarantees, but Erlang has outlined the reasoning why messages shouldn't be considered to be guaranteed here [1].

[1] If I send a message, is it guaranteed to reach the receiver?: https://erlang.org/faq/academic.html#idp32844816

That link is talking specifically about the bare send (!) when in a distributed setting. Erlang guarantees locally (in the same VM) and guarantees ordering as well locally (ordering when done in a synchronous block, if you have a single process do send msg1 followed by send ms2, locally it's guaranteed that msg1 will be "received" first, if the receiver is alive). Outside the same VM even if in the same host it can fail (someone closed the socket the other VM is on for e.g).

Erlang also bakes OTP, that is a library for messaging semantics and process behaviours (that processes have to implement to be OTP) and introduces the concept of a "call", where a unique reference is created for the message being sent and only when the receiver processes the message and "replies" (with the answer and the reference) is the "call" considered complete and allows to be sure the message was processed to the point of sending that reply. This is the solution the "ack" mentioned in the linked doc refers to. It's not inherent in send because send is async and the only way to have it know that, is to wait for an ack.

(you can implement the call semantics with plain processes, but it's such a normal thing that in OTP it's baked at a lower level for the process behaviours included in OTP, mostly all the gen_* behaviours)

All of this breaks down in distributed settings because it's physically impossible to guarantee. Your message may be received but the answer back may not because the network glitched or the hardware blew before the response was sent. These are problems of distributed systems though. You can be sure that if you get a reply from a call that the message was received. You still need to take (or not) care of some of the failure modes accordingly to your requirements (be it having idempotency, retry logic, nodes behaving as queue processors, etc). Some failure modes encode the reason as well, for instance a failed call to a non-existent pid in a functioning node that is reachable is different than a failed call to a non-reachable node, but a failure in a node that went down or a node that is alive but not reachable is impossible to discern without additional things.