|
|
|
|
|
by yuliyp
2256 days ago
|
|
This often devolves into extremely fragile systems instead. For instance, let's say you failed to load an image on your web site. Would you rather the web site still work with the image broken or just completely fail? What if that image is a tracking pixel? What if you failed to load some experimental module? Being able to still do something useful in the face of something not going according to plan is essential to being reliable enough to trust. |
|
But systems should quickly and reliably surface bugs, which are controllable failures.
A layer of suffering on top of that simple story is that it's not always clear what is and what is not a controllable failure. Is a logic error in a dependency of some infrastructure tooling somewhere in your stack controllable or not? Somebody somewhere could have avoided making that mistake, but it's not clear that you could.
An additional layer of suffering is that we have a habit of allowing this complexity to creep or flood into our work and telling ourselves that it's inevitable. The author writes:
> Once your system is spread across multiple nodes, we face the possibility of one node failing but not another, or the network itself dropping, reordering, and delaying messages between nodes. The vast majority of complexity in distributed systems arises from this simple possibility.
But somehow, the conclusion isn't "so we shouldn't spread the system across multiple nodes". Yo Martin, can we get the First Law of Distributed Object Design a bit louder for the people at the back?
https://www.drdobbs.com/errant-architectures/184414966
And let us never forget to ask ourselves this question:
https://www.whoownsmyavailability.com/