| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bitL 3506 days ago
	The thing is that there are problems we simply can't solve in theory nor in practice, yet we use approximate solutions all the time - and that is the case of advanced distributed algorithms. In theory, we simply can't handle real-world asynchronous systems. And when we pretend we have partially synchronous systems and build abstractions around them, they aren't 100% working. Now add in some complex bugs (like getting a distributed deadlock in transacted system involving exactly 7 nodes but not less nor more) and you might start understanding why functional test case might not really be an option to avoid these issues (you can obviously write them but they won't really help you). I worked on such a system, we had 100,000s of tests yet they were clustered around known issues and not issues that happened when e.g. a node went down and up, data were out of sync and sockets between nodes were becoming full due to OS' performance limitations. And moreover, many of these issues start showing up only when you push throughput to the max, e.g. during trading spikes etc. and adding a test that checks invariants would lower the throughput and those issues simply won't show up anymore.

1 comments

lisivka 3505 days ago

Each test case increases confidence in the system by small amount. Confidence never can approach 100% (because we need to predict future to achieve that), so no amount of testing can give you 100% confidence, only 80%, 95% (2x price), 98% (4x), 99,5% (8x), 99,95% (16x), 99,995% (32x), and so on. It's your message, right?

link

bitL 3505 days ago

My message is more like your confidence after writing hundreds of thousands tests might be just 50%. From my own experience, every single bad case that can happen in a complex system will happen at some point at some customer, wrecking their system and costing them potentially millions, in serious trading bugs even leading to a bankruptcy. Your testing suite won't catch these initially but reactively when you add that test case to your regression suite after bad things happened. In complex systems, tests are just a heuristics for quality, not really something you can rely on (but it's way way better to have them than not). Often tests are clustered around low-hanging fruit or around parts of system used by developers or initial customers and any deviation in usage patterns can cause an outbreak of new, unexpected incorrect situations. Similarly, proving correctness using some formal verification tools might increase your confidence, but won't give you 100% either, as we simply can't model reality properly even within our own frameworks :(

link

lisivka 3505 days ago

In such cases, I use "torture" test cases: lengthy, random test cases, which are trying to abuse and overload system with no data, incorrect data, random data, huge data, high latencies, duplicated messages, missed messages or random aborts, random speaks, etc. They allows me to discover situations not covered by test cases. I also try to use underpowered hardware for such testing. Of course, I cannot imagine all possible torture scenarios, but I saw lot of bug and security reports, so I still know lot of scenarios, more than I willing to write tests for.

link

bitL 3505 days ago

That's a good approach as well! In addition, nowadays with VMs/containers you can even simulate nodes going randomly up and down, which is a bit of a challenge if you do it in a real testing cluster.

link