Hacker News new | ask | show | jobs
by epdlxjmonad 2244 days ago
For debugging a distributed system, it may be just okay to use the traditional way consisting of testing, log analysis, and visualizing that everyone is familiar with. Yes, there are advanced techniques such as formal verification and model checking, but depending on the complexity of the targe distributed system, it may be practially out of the question or just not worth your time to try to apply these techniques (unless you are in a research lab or supported by FAANG). In other words, it may be that there is nothing inherently inefficient with sticking to the traditional way because distributed systems are hard to debug by definition and there is (and will be) no panacea.

We have gone through the pain of testing and debugging a distributed system that is under development for the past 5 years. We investigated several fancy ways of debugging distributed systems such as model checking and formal verification. In the end, we decided to use (and are more or less happy with) the traditional way. The decision was made mostly because 1) the implementation is too complex (a lot of Java and Scala code) to allow formal techniques; 2) the traditional way can still be very effective when combined with careful design and thorough testing.

Before building the distributed system, we worked on formal verification using program logic and were knowledgeable about a particular sub-field. From our experience, I guess it would require a PhD dissertation to successfully apply formal techniques to debugging our distributed system. The summary of our experience is at https://www.datamonad.com/post/2020-02-19-testing-mr3/.

4 comments

I wonder is the promise of cheap hardware using distributed systems is offset by the increased complexity and developer time. Stack overflow scaled up rather than out and I have never seen a problem with their site.
Is S.O. a good example of a complicated/large distributed system? I couldn't find any quick googleable results on how many people work on their site.

The reason I ask is I'm working on a product with 30-ish different pods/teams (maybe about 200 - 250 engineers) working on their respective modules, microservices, etc. From my understanding with talking to a lot of others at conferences, is that our distributed system is fairly small (in terms of functional modules/teams, transactions, etc..).

Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.

Also, I think how a company supports multi-tenanting would play a big role in deciding how this works, too, because you can have scenario with a monolith and DB but you have it partitioned by individual tenant dbs, app servers, etc, and you still have a huge pile of hardware (real or virtual) you're dancing around in....

My point is that Stack Overflow seems to have kept things as simple as possible, the opposite of a "complicated distributed system". It seems to be a classical relational databases backed app with some additions for specific parts where it needed to scale. In the end I guess it is distributed but it looks like its based largely around a monolith.

https://stackexchange.com/performance

> Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.

Lots of 250+ engineer teams out there working on monoliths.

Distributed systems (and specifically microservices) are oftentimes solutions to organizational problems, not technical ones.
Their developers can't write good stuff on their resume though. How will those poor chaps get another job without writing Kubernetes, NoSQL, distributed database, large scale horizonatally scalable systems. /s

AFAIK, writing "Used a large machine to solve customer problems quickly and efficiently" is not really taken well by a lot of people. The majority of companies can better scale up, than out, but out is the new normal for various reasons.

They're using NoSQL and other things. It's mostly Microsoft C# stack though.
What NoSQL? According to their blog they use SQL server.
That’s the problem. Doing the reasonable thing is a career killer.
The notion that stack overflow is small and scaling up is long obsolete. It's running on more than a hundred servers now.
Not according to this:

https://stackexchange.com/performance

Where do you get your numbers from?

Yep, and if you look at average CPU load percentage, it's usually in single digits.
When I worked on supercomputer simulations, I always landed on just dumping intermediate results and looking at them over figuring out how to compile something TotalView could look at.

When the latter was possible it saved time but it was always breaking under me and I eventually said fuck it

> it may be just okay to use the traditional way consisting of testing, log analysis, and visualizing that everyone is familiar with.

Not at all. I spent quite a lot of effort on debugging distributed deadlocks in a highly available system that was the best selling product in its category at one of the most famous (and loved) software companies based in SV and the amount of things that could (and will) go wrong is infinite, given every piece of infrastructure has its difficult-to-find/reproduce bugs. Things like sockets stopping responding after a few weeks due to an OS bug, messing up your distributed log, or unexpected sequences of waking up during a node crash and fail-over because some part of the network had its issues, leading to split brain for a few moments that needs to be resolved real-time, a brief out of memory issue that leads to a single packet loss, messing up your protocol etc. We used sophisticated distributed tests and that was absolutely inadequate. You are still in a naive and optimistic mode, though perhaps you as a designer will be shielded from the real-life issues your poor developers would experience as usually original architects move on towards newer cooler things and leave the fallout to somebody else.

Check out FoundationDB's approach:

https://www.youtube.com/watch?v=4fFDFbi3toc

Using a single thread to simulate everything is cool (as stated in my previous comment on FoundationDB at https://news.ycombinator.com/item?id=22382066). Especially if the overhead of building the test framework is small.

In our case, we use a single process to run everything, in conjunction with strong invariants written into the code. When an invariant is violated, we analyze the log to figure out the sequence of events that triggers the violation. As the execution is multi-threaded, the analysis can take a while. If the execution was single-threaded, the analysis would be much easier. In practice, the analysis is usually quick because the log tells us what is going on in each thread.

So, I guess there is a trade-off between the overhead of building the test framework and the extra effort in log analysis.