Hacker News new | ask | show | jobs
by ones_and_zeros 2318 days ago
I think the argument is if you've been around distributed systems long enough you will encounter race conditions. Sure, it's ok to say "Well, the testing infrastructure isn't up to snuff, so we just need to fix it" but at scale this is impractical.

Check out the fallacies of distributed computing[0]. If your testing system can simulate all of those edge cases, it probably looks a lot like TLA+.

[0] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

2 comments

I agree, once you've been around parallel and distributed computing for a while you do notice this stuff, which is why I think the example given in the article isn't a particularlyt good one - that's a really noticeable design omission and I would expect it to be caught pretty early in any dev/QA process as the developer implements the functionality and thinks "Hey, what if .. don't we need to invalidate the other offers?"

I'm sure there are good cases for using TLA+, I'm sure there are situations where it's not only useful for catching errors before they even happen, but in which this more than offsets the upfront costs of the exercise.

I guess I just came away from the article not feeling that such had been demonstrated, in fact I came away with the feeling that the example was contrived to fit the agenda and didn't actually show much.

I think it's a great example. It boils down to simple code that may not be obvious how it behaves within a degraded network. Your dev/QA could even have the thought and actually test it but it work under ideal circumstances giving them false confidence.

I'll go out on a limb and say it is nearly impossible to design and build a test environment that can simulate all network conditions, so that even in trivial cases where a dev might know for a fact that there is an issue, it'll be incredibly hard to reproduce it.

Maybe put another way, formal methods give cover to dev/QA to avoid shipping known but hard to prove buggy code. Bugs they will ultimately be held responsible for.

Keep in mind, there are different approaches to design distributed systems. You can do pretty much everything without sharing or concurrently accessing any resources, without relying on paxos family of algorithms, etc., it would require you to make sure that algorithms converge, can be retries and so on, but that is local reasoning, easily testable, and it doesn't look like TLA+ even remotely nor can it benefit from it. If you need to model 35 high level steps, your system is definitely not designed that way, is too complex and even philosophically is not in the camp of making distributed algorithms so simple, that they can be easily reasoned about. I guess what I'm saying is that if you need TLA+ you are very likely doing it wrong.
> I guess what I'm saying is that if you need TLA+ you are very likely doing it wrong.

Agree to disagree. If your system is distributed it requires you to share data and act concurrently. Otherwise it isn't a distributed system. No amount of retries will cover all of the screwed up things that can happen on an unreliable network.

Sharing data the way you imply is an anti-pattern in distributed systems, you are probably just stuck in a shared memory multithreading mindset. Instead we use specific consistency models. What I described is about models that don't assume that data is up to date on every node but rather operate with the assumption of stale data. And asynchronous networks fail relatively simply too, you are making it the bigger deal than it is. Well, they don't actually fail, but rather this is how they normally operate, they are just unreliable, so you have to assume that data won't be delivered to destinations in time or at all sometimes. If this screws things up for you, then you are definitely doing it wrong. Even if you need low latency, multiple destinations and multiple independent paths is all there is to it, which, again, works well if you don't use consistency models that require consensus. It's like a different world from imperative algorithms with many steps.