Hacker News new | ask | show | jobs
by derefr 2792 days ago
> I imagine there might be some really interesting (for meanings of interesting that include shoot me now) and hard to track down bugs as you deal with inconsistent clocks not just across systems within a network, but processes within a single system.

I feel like the "safe assumption" that the other end of a given IPC channel (or even inter-thread communication channel) is on the same machine, is responsible for the vast majority of failures we see in e.g. Jepsen testing of databases.

After all, in sufficiently-large computers (i.e. HPC clusters that pretend to be one "computer"), you've got NUMA zones that are light-microseconds away from one another, where even threads of the same process can literally end up needing vector clocks to linearize events between themselves.

It probably wouldn't be too bad a thing if things like the Linux base-system used only internal IPC mechanisms that exposed this unreliability (like e.g. Erlang does with "unreliable async message passing" as its IPC primitive), forcing each component to deal with the fact that its peers may or may not be netsplit away from it.

Even if that scenario will only come up if you're writing code to get your GPS position from a Dyson sphere of 10-mile-deep Matryoska brains.

1 comments

I bet that assumption is responsible for a large number of problems. I just also think it's correct enough most the time and relied on enough that if it all of a sudden often wasn't true, we'd see our carefully crafted applications for what they really are, a pile of assumptions that sometimes have little relation to reality.