| We've had a couple of cases of flaky tests failing builds over the last two years at my company. Most often it's browser / end-to-end type tests (e.g. selenium-style tests) that are the most flaky. Many of them only fail in 1-3% of cases, but if you have enough of them the chances of a failing build is significant. If you have entire builds that are flaky, you end up training developers to just click "rebuild" the first one or two times a build fails, which can drastically increase the time before realizing the build is actually broken. An important realization is that unit testing is not a good tool for testing flakyness of your main code - it is simply not a reliable indicator of failing code. Most of the time it's the test itself that is flaky, and it's not worth your time making every single test 100% reliable. Some things we've implemented that helps a lot: 1. Have a system to reproduce the random failures. It took about a day to build tooling that can run say 100 instances of any test suite in parallel in CircleCI, and record the failure rate of individual tests. 2. If a test has a failure rate of > 10%, it indicates an issue in that test that should be fixed. By fixing these tests, we've found a couple of techniques to increase overall robustness of our tests. 3. If a test has a failure rate of < 3%, it is likely not worth your time fixing it. For these, we retry each failing test up to three times. Not all test frameworks support retying out of the box, but you can usually find a workaround. The retries can be restricted to specific tests or classes of tests if needed (e.g. only retry browser-based tests). |
How do you know? What you say is plausible, but it's also plausible that these rarely-failing tests also rarely-fail in production, and occasionally break things badly and cause outages or make customers think of your software as flaky.
Since you say this, I presume you've spent the time to actually track down the root causes of several tests that fail < 3% of the time? If so, what did you find? Some sort of issues with the test framework, or issues with your own code that you're confident would only ever be exposed by testing, or something else? I'm very curious.