Hacker News new | ask | show | jobs
by joosters 2580 days ago
In an old job, we had a frustrating test that passed well over 99 times in 100. It was shrugged off for a very long time until a developer eventually tracked it down to code that was generating a random SSL key pair. If the first byte of the key was 0, faulty code elsewhere would mishandle the key and the test failed.

Keeping the randomness in the test was the key factor in tracking down this obscure bug. If the test had been made completely deterministic, the test harness would never have discovered the problem. So although repeatable tests are in most cases a good thing, non-determinism can unearth problems. The trick is how to do this without sucking up huge amounts of bug-tracking time...

(Much effort was spent in making the test repeatable during debugging, but of course the crypto code elsewhere was deliberately trying to get as much randomness as it could source...)

5 comments

It doesn't seem to be a strong argument to have non-deterministic tests.

There was the logic that generates the SSL key pair, and there is the faulty logic that consumes it. Based on the description, it seems it's an indication of missing test coverage around the faulty code. If, when the faulty code was written, more time were spent on understanding the assumptions the code has made, then maybe the test wouldn't appear in the first place.

This anecdote, however, does bring up a good point: Don't shrug off intermittently failed tests - Dig in and understand the root cause of it.

I'm not at all surprised that nobody considered the possibility that code might fail if a key starts with zero. It's often hard to identify all the edge cases.

Now that this edge case has been found, of course it should be replaced by deterministic tests that tests the consuming code with different kinds of keys, including one with a leading zero.

So, in short, the test was failing because the code was incorrect, aka the test was working as designed. We've all had flaky tests but the attitude that says "run it again" has always been the wrong one, because there is a problem, yet we're all ok with deferring the issue.
The only other solution is exhaustive property testing. And even that is not workable when concurrency is in play.
Good luck exhaustively testing something with a cryptographic key as input. Non-exhaustive property testing is also pretty cool, though.
There do exist frameworks that allow exhaustive testing of concurrent code. They never really became mainstream though.

https://www.microsoft.com/en-us/research/publication/chess-a...

http://www.1024cores.net/home/relacy-race-detector

> There was the logic that generates the SSL key pair, and there is the faulty logic that consumes it. Based on the description, it seems it's an indication of missing test coverage around the faulty code. If, when the faulty code was written, more time were spent on understanding the assumptions the code has made, then maybe the test wouldn't appear in the first place.

Based on the description of the bug, the bug was in Microsoft's SChannel TLS stack (I know this, because I found it too, and got a workaround into OpenSSL). I don't know about you, but I haven't written a whole lot of comprehensive tests for any TLS stack that I've used, unless I wrote the stack. I'm assuming jooster didn't work for Microsoft, he just worked for a company that released software for some flavor of Windows and used their TLS stack, because it was there.

This definitely fits into the category of nobody is going to test this, because it's not going to occur to anyone to test how the third party TLS library they're using handle public keys encoded without leading zeros, until they've ran into it before. Having a random key generated in a test suite means you've got a chance to see it; if you're lucky (and if you don't just retry everything without knowing why).

Nope, this was on a variety of unix systems, but no Windows code at that point. We used some RSA libraries for crypto primitives and hardware acceleration, with a lot of custom in-house code (BigInts and other math routines). There was no OpenSSL usage. I think the bug turned out to be in our own code, because it was a case of lots of frustration tracking it down, an ‘ah-ha!’ moment, and a quick code check-in fixing the problem there and then. (Yes, we also wrote a new test case!)

But it’s interesting to hear about an identical sounding bug in similar code/routines. I’d say ‘what are the chances?’ but crypto code is always painfully hard.

Non-deterministic testing is related to fuzzing, which is a well-known way to find security bugs. The problem is that it's often too expensive to do all the time.
Have a single source of randomness and record the seed as part of the test run. Then you have repeatable failures combined with finding obscure bugs.
Well, yes. It's much easier to diagnose after the problem has been found :)

The 'gotcha' part in this case was that the SSL keygen code did not just use a random number seed but was grasping for entropy from other sources too (PIDs, perhaps? I can't recall the detail). That unknown made initial attempts to recreate the problem difficult.

Plus the failing test in question was not directly testing the SSL, merely making use of it. There were other SSL-specific tests run separately but they missed this strange corner case (in 'normal' use, it wasn't like 1 in 256 transactions would fail, otherwise that would have been much more obvious and we'd have had spurious-seeming failures all over the place).

That brings up another test pain: You can write lots of specific unit tests for every individual feature your code has, and they can all pass just fine. But when feature A, D and H happen to all be in use at once, you hit a separate problem. Onwards to 100% coverage...

> Well, yes. It's much easier to diagnose after the problem has been found :)

That's kind of unfair, it's pretty much the logical deduction if you think about randomness in your unit tests. I.e., suppose the test fails, then what? You want to reproduce, but really all you know is that the test indicates the possibility of a bug. In order to investigate, you really need to follow the same code path that the test did. How do you do that? Capture the seed.

There are so many other possible factors: timings, interactions with other processes, memory allocations, network conditions, etc. A random seed is just one of several non-deterministic factors that could have been the cause.
Fair enough, though interactions with other processes and network conditions are not relevant for unit tests. Timings maybe, but then you still need a strategy to turn a failing test into information you can use to fix a bug.
I have horrible memories of time spent tracking down other test failures caused by helper programs not shutting down cleanly, and causing obscure bugs in later tests. e.g. a test failing because it couldn’t bind to port 443, because an earlier test helper program sometimes didn’t clean up properly. That’s just one possibility for a process and/or network condition that ruins a test.
I'm a bit confused, that seems like the test did it's job, it found the faulty code elsewhere. (unless by elsewhere you mean in the test code) It just appeared to be flaky and was treated as such until someone looked into it.
I would say that your non-deterministic components here weren't a good thing - instead the test was poorly written and didn't cover the assumptions of the code under test well. The fact that this bug was revealed by a test is useful, but I should hope that now the tests have cemented that case in a regression suite in a deterministic manner.
> the test was poorly written and didn't cover the assumptions of the code under test well

That's a cop out, you rarely know all the assumptions of all the code. The point of tests is to hopefully suss out those unknowns. Just saying "you should think harder and write better tests" doesn't result in better code.

I don't disagree that it's hard to know all these shortfalls ahead of time, that's sort of why software has bugs and why all us devs stare in amazement anytime it's suggested "Could you folks just... write less bugs" but when the builds start to fail it should be acknowledged that your tests are incomplete, and it should (ideally) be a high priority task to either revert out the feature or fix the feature to resolve the issue. (There are all sorts of reasons this could be difficult, but I'd urge you to advocate strongly for fixing bugs as a higher priority task than feature work - it ends up saving businesses money)
Sometimes you can store the random seed so you get the best of both worlds. Not with crypto though.