Hacker News new | ask | show | jobs
by teknopaul 1299 days ago
Can you explain how making flakey tests, not flakey, helps find bugs. I would have thought these differences are essentially free fuzzing and desirable?
3 comments

Sure! I think underpinning your question is a really subtle point there. And I think the answer is in the different purposes of regression testing and bug finding. In regression testing (CI), you're testing if the code introduced new problems. You don't at that point in time really want to know that someone else's test downstream from your component fails when given a new thread schedule that it has not previously seen. Wherease if you're stress testing (including fuzzing and concurrency testing) you probably want to torture the program overnight to see if you can turn up new failures.

The Coyote project at Microsoft is a concurrency testing project with some similarities to Hermit. For the reasons above, they say in their docs to use a constant seed for CI regression testing, but use random exploration for bug finding:

  https://www.microsoft.com/en-us/research/project/coyote/
Still, it does feel like wasted resources to test the same points in the (exponentially large) schedule space again and again. Kind of like some exploration/exploitation tradeoff.

We don't do it yet, but I would consider doing a randomized exploration during CI, but making the observable semantics the fixed version. If the randomized one fails, send that over to the "bug finding" component for further study, while quickly retrying with the known-good seed for the CI visible regression test results.

I don't think there's one right policy here. But having control over these knobs lets us be intentional about it.

P.S. Taking the random schedules the OS gives us is kind of "free fuzzing", but it is very BAD free fuzzing. It over-samples the probable, boring schedules and under-samples the more extreme corner cases. Hence concurrency bugs lurk until the machine is under load in production and edge cases emerge.

Once we have complete control over the determinism of a test, we can start to play with tweaking the non-deterministic inputs in a controlled way. For example, we can tweak the RNG seed used for thread scheduling to explore schedules that wouldn't normally happen under the Linux scheduler.
How do you know if a flakey test has been fixed? A deterministic environment can turn flakey into repeatable failure and then known to be fixed.
Well, we don't prove the absence of concurrency bugs -- that would be more a job for formal verification, type systems, at the source level.

But we can tell when our `--chaos` stress tests cease to produce crashes in reasonable numbers of runs. And when we do achieve a crash we can use our analysis phase to identify the racing operations.

It's both a pro and a con of the approach that we work with real crashes/failures. This means its a less sensitive instrument than tools like TSAN (which can detect data races that never cause a failure in an actual run), but conversely we don't have to worry about false positives, because we can present evidence that a particular order of events definitely causes a failure. Also we catch a much more general category of concurrency bugs (ordering problems between arbitrary instructions/syscalls, even between processes and in multiple languages).