Hacker News new | ask | show | jobs
by matharmin 2577 days ago
We've had a couple of cases of flaky tests failing builds over the last two years at my company. Most often it's browser / end-to-end type tests (e.g. selenium-style tests) that are the most flaky. Many of them only fail in 1-3% of cases, but if you have enough of them the chances of a failing build is significant.

If you have entire builds that are flaky, you end up training developers to just click "rebuild" the first one or two times a build fails, which can drastically increase the time before realizing the build is actually broken.

An important realization is that unit testing is not a good tool for testing flakyness of your main code - it is simply not a reliable indicator of failing code. Most of the time it's the test itself that is flaky, and it's not worth your time making every single test 100% reliable.

Some things we've implemented that helps a lot:

1. Have a system to reproduce the random failures. It took about a day to build tooling that can run say 100 instances of any test suite in parallel in CircleCI, and record the failure rate of individual tests.

2. If a test has a failure rate of > 10%, it indicates an issue in that test that should be fixed. By fixing these tests, we've found a couple of techniques to increase overall robustness of our tests.

3. If a test has a failure rate of < 3%, it is likely not worth your time fixing it. For these, we retry each failing test up to three times. Not all test frameworks support retying out of the box, but you can usually find a workaround. The retries can be restricted to specific tests or classes of tests if needed (e.g. only retry browser-based tests).

5 comments

> If a test has a failure rate of < 3%, it is likely not worth your time fixing it.

How do you know? What you say is plausible, but it's also plausible that these rarely-failing tests also rarely-fail in production, and occasionally break things badly and cause outages or make customers think of your software as flaky.

Since you say this, I presume you've spent the time to actually track down the root causes of several tests that fail < 3% of the time? If so, what did you find? Some sort of issues with the test framework, or issues with your own code that you're confident would only ever be exposed by testing, or something else? I'm very curious.

It's possible, but after fixing lots of these, my experience says usually talking about stuff like clicking a button before a modal animates out of the way.

It's sort if a "bug" in that yes, clicking here and then here 1ms later doesn't do do the best thing, but it's basically irrelevant.

Testing is inherently a probabilistic endeavor.

"What can I do that is most likely to prevent the largest amount of bugginess?"

Fixing tests that rarely fail is -- in my experience -- a poor answer to such a question.

> Testing is inherently a probabilistic endeavor.

That's a pretty powerful insight!

I think that a lot of developers who are firmly in the test-driven camp don't realize this, but instead think that if you have 100% test coverage, your code will work 100% of the time. Fixing bugs, to them, is "just" an inevitable result of increasing your test coverage, so that's what they focus on.

My point here is that even if it may be because of flaky code, general unit and integration tests are the wrong tools to test for flaky code. The only exception I have encountered here is if you have code that is written to specifically handle concurrent situations, and your test is focussing specifically on testing the concurrency part.

The most common places these flaky tests occur are with integration/browser-based tests, where there are multiple layers of tools that each fail a small percentage of the time.

Unit tests also sometimes fail because of not cleaning up state properly, which only breaks things when tests run in a very specific order. Or sometimes subtle assumptions in the tests about database ordering that is only valid 99% of the time.

> Most of the time it's the test itself that is flaky

I have always understood that unit tests must inherently be deterministic for the reason you explain.

A small test that is not deterministic is testing something other than "the unit" since there is another independent variable unaccounted for, often the state of the database or the configuration of a test environment.

Not that unit tests are perfect. Unit testing a concurrent data structure without threads (which are inherently nondeterministic) is not especially useful.

I see there as being a tension between determinism and mocking. Classic TDD dogma says to mock super close to the unit under test, so that the only logic in play is the logic within that unit. Which is all well and good, but there's lots of code out there where the stuff that breaks is the stuff on the interfaces; once you mock that out, you've removed a significant chunk of what might legitimately break, and the therefore diminished the value of the test.

So it's a balance. Sometimes it really is worth it to just attack that one function with its weird snarl of if statements and initial conditions— totally. But there are other cases where part of what you want is to inspect what happens in the adjacent object, on a different thread, as a result of stimulating something under test conditions. This isn't wrong, and these kinds of tests can be really hard to get completely deterministic, especially if the CI environment is some heavily-loaded VM host with totally different thread switching characteristics from your laptop.

I have come to conclude that excessive mocks are a symptom of poor architecture.

Classic TDD as you describe (see the other reply, classic TDD is different) works great for algorithms: take some data, manipulate it, and get different data out. There is no need for mocks. This is where you business logic should be, and it is easy to test.

However this fails in the real world because algorithms are but a minority of code: most code in my experience is just moving data around from subsystem to subsystem, and external collaborators. Here you do have collaborators and the interactions are the point. Mocks now start to make sense because the point is my subsystem deliver data to that something else, and I shouldn't know or care what that something else is.

I've seen the above fail in several ways. I've seen people mock their algorithm from the communication, but in practice the communication and the algorithm are tightly coupled anyway so changes in once will change the other.

Worse, I see many people test not the subsystem boundaries, but boundaries within the subsystem. That is they start writing the subsystem, and then realize (correctly) that they need to break the subsystem up, then they test the subsystem as it is broken down. This seems good, but it leads to brittle systems that cannot be changed because the sub-subsystem is now not allowed to change because it would break tests..

To understand this, remember, a test is an assertion that something will not change. Thus if you mock a collaborator you are asserting that the collaborator is a different subsystem and you and not allowed to refactor across this boundary. If the boundary is not an architecture boundary you shouldn't mock it because you might want to change it.

> Classic TDD dogma says to mock super close to the unit under test, so that the only logic in play is the logic within that unit.

I suppose it depends on your definition of "classic TDD dogma". Mocking really wasn't a thing until TDD had been around for about 5-10 years, so super classic TDD dogma has always been "don't mock" ;-)

"London School", GOOSE, Outside-in approach has always been to mock heavily. I call it "wish based programming". You write a test, wishing that you had some facility and since you don't have it, you mock it. Then once the test is in place, you can write your code and eventually write production code that represents the mock (and personally, I remove the mock at that point).

It was really after that, as far as I can tell, that people started to get the idea that you should mock all your collaborators in order to isolate your units. This kind of isolation was never a thing originally (see Kent Beck's original book on the subject). Even if you watch DHH's conversations with Kent Beck (and I think Martin Fowler???) on the topic and they state pretty clearly that "Chicago School" is to avoid mocking except as a last resort (my own personal preference as well). Also take a look at Michael Feather's discussion in his Legacy Code book for a good description of what the original ideas what fakes, stubs and mocks were. These days those definitions are practically lost.

I'm not sure why there has been this idea that mocking was always a part of TDD, but it definitely is a popular notion.

Mocking a module's dependencies decouples the module from the dependency modules. To me, that's the payoff of mocking. And mocks only really click for me in the wider scope of "Outside-In"-style TDD.

It's the black box/functional/integration test that exercises the production code from the standpoint of the enduser that proves whether a tested module's dependencies actually satisfy their contracts. Also, the functional tests are the only place that you can discover if DevOps is needed as well prior to deploying into a real test/prod environment. Plus, the functional test captures the user story that we're focused on in a way unit tests cannot, so the functional tests direct the overall work.

I must have those functional tests in place before I do my unit tests. Otherwise, those mocks really are creating a wish-based programming system.

I agree with you that mocking of today is indistinguishable from what Michael Feathers described in his Legacy Code book (which is excellent regardless, BTW.) Mocking today is so easy to express and change and grok with tools like Spock Framework.

Interesting, thanks for the history lesson— I feel a bit better about my own stance, which is also largely to mock as a last resort.

Although I've never been a ruby programmer, you're right that I'm influenced by DHH and the ruby community's approach on a lot of these things.

I wasn't arguing against less deterministic tests. I was just saying "unit test" isn't the name for them. Call them "small tests" or "smoke tests" or make up a new term.
Not all tests are unit tests. I had a property test I was running that I eventually just turned off because it was working just fine on everyone's machine but would fail 60% of the time on Travis due to time out issues. It got worse from 30% after Travis was sold, I suspect they are skimping on the aws. I probably should have written a more effect dependent timeout, but it was hard to justify recoding something when your test is long and your retrigger is via Travis.
I find

> 3. If a test has a failure rate of < 3%, it is likely not worth your time fixing it. For these, we retry each failing test up to three times. Not all test frameworks support retying out of the box, but you can usually find a workaround. The retries can be restricted to specific tests or classes of tests if needed (e.g. only retry browser-based tests).

to be pretty terrifying. I know that folks are under different amounts of pressure but we'd reject that code from merging here (or revert it out when we discovered the flakiness) as it's basically just a half-finished test that requires constant baby sitting.

I'm not sure how you'd reject that flakey test. We're talking <3% so first let's assume that you don't even see the failure until 10 other PRs get merged. Not only do you not know what caused the failure, it could be that the failure is in a test which was in the code for ages but the new code breaks its assumptions / initial environment.

Sometimes you can't just point at one thing and say reject this or revert that without a long investigation.

If the test failure is detected (so, you get super lucky) you should immediately reject the code, including the tests failing before some other fixups that didn't effect that test... Oftentimes it will take a long time to surface these, but I'm of the opinion that a broken build is show stopping until it's resolved - that doesn't mean 5-alarm all devs rush to the scene, but it does mean a free person picking it up as their next task or, barring that, bumping someone off of feature or other work to address the issue.

It may take a while... but while that flakey test exists in your codebase it will leverage a constant cost on all of your developers.

> you should immediately reject the code

Which code? When you randomly run into the flakey test, in most cases it's not coming from the change which was just tested. You'd reject some random, unrelated PR

I'm jealous of your workplace's attitude and latitude towards testing
We're the inheritors of a legacy code base, part of this involved taking a strong stance to go from zero to hero in terms of testing, no minor bugs are fixed in areas of code not covered with automated testing - this has made our feature work slow right down but we are lucky to have management's support in paying this cost now rather than paying interest on it as time progresses.
> Most of the time it's the test itself that is flaky

I recently went through a heavy de-flaking on a suite of Selenium tests. I found this comment to be true in my case; it was reasonable-seeming assumptions in the tests that caused flakiness more often than anything else. The second most common cause was timing or networking issues with the Selenium farm.

Spending the time actually de-flaking the tests was quite enlightening and lead to some new best practices for both writing tests, and for spinning up Selenium instances.

Because of that experience, I'm not sure I would agree with giving up on tests that fail less than 3% of the time, because fixing one of those cases can sometimes fix all of them. Learning the root causes of test failures lead me to implement some fixes that increased the stability of the entire test suite. Sometimes there's only one problem but it causes all the tests in a group to fail one at a time, infrequently and seemingly randomly.

I'm having similar issues with Selenium test at the moment. Can you give any insights into deflaking tests failing on timing or networking issues (I'm guessing it's a case by case basis, but generic tips might work for some problems)? This is by far the biggest pair of reasons for the suite failing for us, so any help would be really...er...helpful! :D
For me, the main class of flaky Selenium tests were expecting CSS related changed to appear on-screen in the same frame or one frame later, when sometimes they take more than one frame. The main result was to rarely/never execute multiple actions in a row without waiting; the tests changed considerably into pairs of [ action -> wait for specific CSS change on screen ] throughout all the tests. Also I did have some sleep calls for a few hard to test things, even though I knew it was a bad practice, and I spent time eliminating all of them and figuring out the fundamental reason it was hard to measure and/or adding something to the front end code to signal via CSS that a long running operation was complete.

On the setup side, the main networking issue for me was ssh. This might be different for you if you use either a Selenium cloud service, or a proper VPN setup. I was spinning up a virtual network on AWS for a Selenium farm in a fairly ad-hoc and manual way using ssh. The tests were sometimes (only very occasionally) starting before the ssh connection was actually ready, so I had to put a delay in the setup script to send & wait until a packet had actually been received before launching the test. I used netcat for that.

My approach has been to write a layer into our testing system that retries `find_element_by_*` for n seconds on failure (5 works for most of our tests) before reporting problems. I mean to eventually generalize and open-source them (I will try to contribute back, but since I only do this for Python, it might be rejected).

Just having this means a lot less time finding those problems for me.

I fix flaky tests, and a 3% failure rate would drive me crazy. But I don't automatically rerun tests. I was skeptical of the idea that rerunning tests would help, so I did a bit of math:

A test suite with 1 test that fails 3% of the time will succeed 97% of the time. (1-.03)

A test suite with 10 tests that fail 3% of the time will succeed 74% of the time. (.97^10)

100 flaky tests? Now half your test runs fail. (.97^100)

You're retrying three times? Now your test suite is slow, but you can have up to 2,000 flaky tests before it starts becoming a real problem. Or 60,000 if you retry four times. (1-.03^3)^2000 = 94.7%; (1-.03^4)^60000 = 95.3%.

My conclusion: rerunning flaky tests is a legit way of solving the problem, as long as your tests aren't too slow. Still makes my skin itch, though. Fixing flaky tests forces me to face design flaws in my code.

(The math, in case I did it wrong: .03^3 = f = chance of a 3% failure test failing three runs in a row. 1-f = s = chance of test succeeding. s^1000 = chance of test run with 1000 flaky tests succeeding.)

Your math is mostly correct, except that for 100 flaky tests your test run will only pass in 4.8% of cases.

That's why it's very important to retry individual tests, and not the entire test run.

Oops, you're right. I was moving too quick and misread .04755 (4.8%) as .4755 (48%).