| Fun stuff, amusing that the definition of a distributed system used; "Where a computer that you never heard of can bring your system down." is actually one of Leslie Lamport's more famous quotes. When I joined Sun in '86 I thought it was the pinnacle of technological excellence to be a kernel programmer, and I joined the Systems Group, the notional center of the Sun universe, in 1987. However I discovered that the primary reason you had to be picky about kernel programmers what that their bogus pointer references crashed the machine (as they occurred in kernel mode with full privileges) but discovered that network programmers could crash the whole world with their bugs. So clearly they must be in a pantheon above kernel programmers. :-) The author has come to discover that in the network world things can die anywhere, and this makes reasoning about such systems very complicated. Having been a part of the RPC and CORBA evolution I keenly felt the challenges of making APIs that "looked" like function calls to a programmer but took place across a network fabric and thus introduced error conditions that couldn't exist in locally called routines. (like the inability to return from the function due to a network partition for a simple example). Lamport's work in this space is brilliant and inspired. Network systems can be analysed and reasoned about as physical systems when they exhibit discontinuities when considered as simple algorithms. The value here is to realize that a large number of physical systems tolerate a tremendous amount of randomness and continue to work as intended (windmills for example) while many algorithms only work consistently given a set of key invariants. I gave a talk that was inspired by Dr. Lamports work titled 'Java as Newtonian Physics' which was a call to action to create a set of invariants, in the spirit of physical laws, that would govern the behavior and capabilities of distributed systems. It was way early for its time (AOL dialup connections were still a thing) but much of the same inspiration (presumably from Lamport) made it into the Google Spanner project. As with many things, at a surface level many people learn an API which does something under the covers across the network but having come up through their education thinking of everything as an API they don't fundamentally grasp the notion of distributed computation. Then at some point in their experience there will be that 'ah ha' moment when suddenly everything they know is wrong, which really means they suddenly see a bigger picture of things. It makes distributed systems questions in interviews an excellent litmus test for understanding where people are in their journey. |
So then the trick becomes to make sure that a message contains a payload that is 'worth it'.
Making the assumption that any message may not make it to its destination and that confirmations may be lost (akin to your return example) is still challenging but I find it easier to reason about than in the RPC analogy.
I love that Lamport quote :)
A nasty side effect of all this network business is that what looks like a function call can activate an immense cascade of work behind the scenes, gethostbyname (ok, getaddrinfo) is a nice example of such a function. On the surface it's a pretty easily understood affair but by the time you're done and you get your results back you've likely triggered millions of cycles on 'machines that you've never heard of'.