Hacker News new | ask | show | jobs
by _yosefk 3287 days ago
Wow, only learned about rr today - an amazing project! Google for instance reported a sizable number of "flaky tests" which tend to pass but sometimes fail. Always running tests under rr would take care of that (since each failure would be reproducible.)

This is a huge deal. While I prefer the Cilk approach (automated debugging pinpointing places which can theoretically execute non-deterministically), it's not always applicable and isn't always or even often applied where applicable. This is definitely the next best thing, and in absolute terms, it's pretty damn good.

2 comments

IME the problem with applying rr to a flaky test is that it's not uncommon that rr will just reliably and deterinistically cause the test to not fail. This is especially true for failures due to multithreading bugs and similar race conditions, where they can easily just not manifest under rr's deterministic thread scheduler. So if you always run your tests under rr your flakiness problem may go away but the underlying bug causing it hasn't...

rr's pretty great for a lot of purposes though -- I think the clearest use case for it is the "memory corruption happens but doesn't get noticed until execution has progressed a long way forward from the actual site of the bug" kind of problem, which it can speed up debugging of massively.

Try lots of runs under chaos mode: http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mod...

If you figure out the cause of a bug that rr chaos mode was not able to reproduce, please file an rr issue and explain the situation. We've generally been able to improve chaos mode to find hitherto non-reproducible bugs.

" Always running tests under rr would take care of that (since each failure would be reproducible.)"

That's not actually the hard part :)

It honestly really wouldn't help that much, even if it wouldn't be infeasible for other reasons (increased resource usage, even of their cited 1.2x, is a ton, etc).

Even if you could completely reliably restore the state to a random other machine than the one it ran on (IE possibly different arch/memory/etc configuration), and you met all the requirements, the main part rr helps you with is reproducing the failure, which, most of the time, google could actually do anyway.

The hard part is figuring out what went wrong. Remember Google already has asan, ubsan, and tons of other stuff running. So the trivial causes are pretty much not occurring. It's not like people look at the flakes and say, welp, i messed up this one variable and that was that!

It's usually torturous debugging of trying to understand the set of conditions that have occurred.

IE The reason there are so many flakes is because there's so much that can go wrong. Not because people can't reproduce the failures.

Also note that in Google's world, RR would probably not be compatible with how the tests are run anyway (RR wants fairly exclusive to the perf counters, but the tests may be getting sampled), and for a set of flakes, perturbing the performance counter settings will change things enough to make them less flaky!

rr helps with indeterminism. Memory sanitizers help only a little with that, and, sadly, any thread sanitizer working with POSIX-like threads also helps only a little with that. Reproducing the problem (somewhat differently every time) is one thing, reproducing it in exactly the same way every time and looking at state for as long as you want to is another ballgame entirely. And a 1.2x slowdown is not a ton. 2x is not a ton.

I've developed automated debugging tools, I debugged hundreds and hundreds of rare, irreproducible bugs or just bugs that are hard to figure out, I found things like memory ordering bugs in others' concurrent code just by eyeballing it because the bug was too rare to wait for it to reproduce, I debugged concurrent code on multicore chips without memory coherence, on faulty hardware, etc. You're entitled to your own opinion about what "the hard part" is, but I'm likewise entitled to mine, and I think rr is awesome and I disagree with you on every point.

OTOH, what rr also brings to the table is reverse debugging, and that makes these kinds of torturous debugging an order of magnitude easier.

So not only are you able to reproduce the error, but you're also able to go backwards to find how it happened!

I, for one, barely use gdb anymore because reverse debugging with rr makes debugging so much easier (well, technically, I still do use it, since rr is not an entirely new debugger, you still end up in with a gdb prompt).

FWIW, one of the first things I tried when I first used rr was to debug a crash I had debugged years earlier, that was due to a miscompilation by GCC. As that had happened years earlier, I didn't remember the details of what code was miscompiled in what particular way, but I do know that debugging that took a long time, and I had only figured it out by chance because valgrind pinpointed to related code. With rr it only took minutes to find the root of the crash.

"So not only are you able to reproduce the error, but you're also able to go backwards to find how it happened!"

For some (and i'd guess a bunch of the people i mentioned in the parent comment), they definitely find this easier, but just to present a contrarian position: i actually don't. I admit to being weird - in a former life, i was a gdb maintainer.

I also was trained by people who believed the right approach was not to immediately try to find the sets of conditions and variables that caused your problem and declare victory, but to go and meditate upon the code and think about it until you understood it well enough to understand why this might happen even when you think it couldn't. That will often enable you to understand the code well enough to see what else is wrong.

(Again, i don't claim it's better, i just claim that's why i tend not to care about RR. The hard part for me is the thinking about the code, not the finding the sets of conditions and variables that caused a particular set of errors)

It's definitely the case that, personally, when i follow the "find conditions, fix bugs" approach, i tend to write much buggier code (even with good testing strategies) than when i follow the other way.

This reminds me of Linus Torvalds' distaste for debuggers.

I think eschewing debugging is fine for code you understand pretty well and when you already have significant information about the failure. But when those conditions aren't met, debuggers are very useful. (NB, if you use logging code and think "I'm not using a debugger!", you're just using a bad one.)

It's true that a good debugger tempts one to think less deeply about the code than one should, but that temptation can be overcome.

I wonder if there's a 'hello world' for rr - something funky that is made simple by being able to step in reverse. It's definitely very useful in Visual Studio.
Any heap memory corruption bug is a great example for rr. Replay the failure, spot the corrupt location, set a data watchpoint on it, "reverse continue", and you'll stop where the corruption happened. The same bug is generally horrible to find using a regular debugger.

The Visual Studio feature you mention is probably Intellitrace, which is considerably more limited than rr and doesn't handle heap state.

> most of the time, google could actually do anyway

Maybe you can reproduce the failure once in a while, but can you reproduce it on every run, so that when you apply the debugger to a particular run you're sure to see the failure? That's what rr gives you. Furthermore rr lets you debug the same execution over and over again, so stuff like event ordering and object addresses stay the same between debugging sessions.

And as glandium said, on top of that we build reverse execution, which is a real game-changer. Debugging is about tracing effects back to causes, and reverse-execution is what you actually want for that.

FWIW rr has a fair number of users who find it really does help a lot for all kinds of debugging tasks, not just those involving flaky tests. Some of those users are even at Google :-).

> RR wants fairly exclusive to the perf counters

No, it can get by with just one of the general-purpose counters, though two is preferable. Most modern Intel CPUs have at least four GP counters, plus some other dedicated counters for the most commonly used events (instructions, cycles).

> Also note that in Google's world, RR would probably not be compatible with how the tests are run anyway

That may well be true, but I suspect it's more likely to be some other issue, e.g. that the tests run in VMs that don't virtualize the PMU.

It's true that rr perturbs tests, making some bugs difficult to reproduce and making other bugs show up. However, a single failing run captured by rr is almost always enough to figure out the bug. We also have some techniques for randomizing perturbation (e.g. of scheduling) to expose a wider variety of bugs.

chandlerc had a similar initial reaction. I eventually badgered him into trying it, and now he preaches the gospel whenever someone talks about a bug they can't figure out.

Maybe it's the kind of thing that is hard to get until you try it. I'm like glandium -- I almost never use gdb proper anymore, and hard bugs seem to vaporize as soon as I run rr replay. In particular, the idiom of "notice a bad value, set a watchpoint, reverse-continue" completely changed my relationship with my debugger.

For me, debugging flakes is a small part of the value, although maybe that just reflects the kinds of systems I work on.

I do agree there's probably no need to run all tests under rr, at least with the setup we have at Google.