| Every company I've founded or worked for has struggled with flaky tests. Twitter had a comprehensive browser and system test suite that took about an hour to run (and they had a large CI worker cluster). Flaky tests could and did scuttle deploys. It was a never-ending struggle to keep CI green, but most engineers saw de-flaking (not just deleting the test) as a critical task. PhotoStructure has an 8-job GitLab CI pipeline that runs on macOS, Windows, and Linux. Keeping the ~3,000 (and growing) tests passing reliably has proven to be a non-trivial task, and researching why a given task is flaky on one OS versus another has almost invariably led to discovery and hardening of edge and corner conditions. It seems that TFA only touched on set ordering, incomplete db resets and time issues. There are many other spectres to fight as soon as you deal with multi-process systems on multiple OSes, including file system case sensitivity, incomplete file system resets, fork behavior and child process management, and network and stream management. There are several aspects I added to stabilize CI, including robust shutdown and child process management systems. I can't say I would have prioritized those things if I didn't have tests, but now that I have it, I'm glad they're there. |
I’m founder at Tesults (https://www.tesults.com) where we have a flaky test indicator that makes identifying these tests easier. It’s free to try and if you can’t get budget for a proper plan send me an email and I’ll do what I can.
In general the only way to never have flaky tests is to have simpler tests but I find those often don’t provide as much value - that’s just my personal belief after having spent years focused on automated tests, e2e tests do have robustness issues but the bugs they find make them totally worth it. Out of the issues mentioned in the article that affected my tests the most, it’s timing. They can be overcome though, I’ve run test suites with a couple of thousand e2e tests (browser) that have been highly robust and reliable after time was devoted to hardening them. You do have to focus on that and refuse to add new test cases until the existing ones are sorted out in some cases.