|
|
|
|
|
by notacoward
2576 days ago
|
|
I deal with this issue a lot in my current job, and did in my last job too. IMX timing issues are by far the most common culprit. Usually it's because a test has to guess how long a background repair or garbage-collection activity will take, when in fact that duration can be highly variable. Shorter timeouts mean tests are unreliable. Longer timeouts mean greater reliability but tests that sometimes take forever. Speeding up the background processes can create CPU contention if tests are being run in parallel, making other tests seem flaky. Various kinds of race conditions in tests are also a problem, but not one I personally encounter that often. Probably has to do with the type of software I work on (storage) and the type of developers I consequently work with. No matter what, developers complain and try to avoid running the tests at all. I'd love to force their hand by making a successful test run an absolute requirement for committing code, but the very fact that tests have been slow and flaky since long before I got here means that would bring development to a standstill for weeks and I lack the authority (real or moral) for something that drastic. Failing that, I lean toward re-running tests a few times for those that are merely flaky (especially because of timing issues), and quarantine for those that are fully broken. Then there's still a challenge getting people to fix their broken tests, but life is full of tradeoffs like that. |
|