| " Always running tests under rr would take care of that (since each failure would be reproducible.)" That's not actually the hard part :) It honestly really wouldn't help that much, even if it wouldn't be infeasible for other reasons (increased resource usage, even of their cited 1.2x, is a ton, etc). Even if you could completely reliably restore the state to a random other machine than the one it ran on (IE possibly different arch/memory/etc configuration), and you met all the requirements, the main part rr helps you with is reproducing the failure, which, most of the time, google could actually do anyway. The hard part is figuring out what went wrong.
Remember Google already has asan, ubsan, and tons of other stuff running. So the trivial causes are pretty much not occurring. It's not like people look at the flakes and say, welp, i messed up this one variable and that was that! It's usually torturous debugging of trying to understand the set of conditions that have occurred. IE The reason there are so many flakes is because there's so much that can go wrong. Not because people can't reproduce the failures. Also note that in Google's world, RR would probably not be compatible with how the tests are run anyway (RR wants fairly exclusive to the perf counters, but the tests may be getting sampled), and for a set of flakes, perturbing the performance counter settings will change things enough to make them less flaky! |
I've developed automated debugging tools, I debugged hundreds and hundreds of rare, irreproducible bugs or just bugs that are hard to figure out, I found things like memory ordering bugs in others' concurrent code just by eyeballing it because the bug was too rare to wait for it to reproduce, I debugged concurrent code on multicore chips without memory coherence, on faulty hardware, etc. You're entitled to your own opinion about what "the hard part" is, but I'm likewise entitled to mine, and I think rr is awesome and I disagree with you on every point.