Hacker News new | ask | show | jobs
by kevindong 1726 days ago
At $PRIOR_JOB, it always felt like the full E2E tests approached useless since for every bug successfully caught, it felt like there were ~20 false positives. At which point, everyone (myself included) blamed the tests and just repeatedly reran the tests until they usually passed. Every single failure would halt the pipeline anywhere from 5 minutes (in the case that rerunning the failed test shows that it was just a flaky test) up to multiple hours since everyone would rather try to diagnose/hotfix the issue rather than revert their code to unblock the pipeline.

With that being said, a full run of the E2E suite at $PRIOR_JOB took very, very low double digit minutes so it wasn't that expensive. Rerunning a handful of failed tests took single digit minutes so it wasn't too terrible.

2 comments

Was in a similar situation, and the VP of engineering banned the practice of rerunning failed tests, so flaky tests caused everybody pain. In less than 8 weeks the false positive rate dropped by about 3 orders of magnitude. There's a strong tendency to treat tests as a hurdle to get over rather than to treat them as first-class part of the development process.
I imagine this would just turn into everyone inserting 10 second pauses on the tests that fail. Which works, but now your suite doubles the run time. Actually turning nondeterministic tests into deterministic ones is... hard. Really hard in some cases. Many devs don't even understand how to get there, even after years of E2E experience.

One place I worked, the E2E suite took a full hour to run. Everyone reran the tests. Merges took a full day in many cases. Management tried to force people to fix broken tests. But they also required new tests on new features. So it was a constant treadmill. There was basically a full mutiny by the end and the company killed off their entire E2E suite.

If people just started throwing random sleeps into tests, I think management would shit a brick. Do people throw random sleeps into production code to fix bugs where you work as well?
Not GP, and fortunately not often, but I have seen that done to overcome race conditions. I pushed for it to be corrected by using a proper design. That was a stupidly hard fight, though.
My pet peeve is people sprinkling C's "volatile" keyword in places. Since doing so inhibits many optimizations, it changes the timing and can make race conditions appear to go away.
Yep. Lots of effective ways to paper over issues without actually resolving them, and often disguising them so that resolution becomes nearly impossible later.

Worst, things like the introduced sleeps in some of the systems look legit. There are reasonable times to introduce a timed delay into your program (3rd party APIs have a rate limit, 1 request per second or 10 per 30 seconds or whatever). Depending on how you introduce these extra sleeps, then, it's possible that they'll look like they satisfy a valid requirement, when the reality is that they exist to cover up the absence of things like proper use of locks/mutexes or other elements.

At one place I consulted, the fte lead ignored flaky tests and attributed failures to the tests being wrong.

A few months later...

The code that was failing intermittently was found to be using floating point types for money. Yeah, I'm gonna wanna fix that.

Right if you have flaky tests there are 3 acceptable responses:

1. Fix the test

2. Fix the code that is being tested

3. Say "well we don't need this software to be reliable anyways so let just stop running tests"

But many places seem to adopt hidden option #4 "Run the tests and ignore failures"

A related issue is dialing the tunables for warnings up to 11 and then not reading any of the warnings. Once I saw a case where the build generated 1000s of warnings. Found a bug and said "this would be flagged as a warning even with relatively low warning settings" sure enough it was.

Obviously fixing warnings is good, but if they had just lowered the warning setting to be something reasonable, they would have had maybe 10 warnings total, one of which was a bug, which is a lot more useful than 1000s of warnings, at least one of which was a bug.

Option #4 is just option #3 but keeping the costs of running tests you ignore.

You're right about excessive warnings, but then sometimes note. Running `gcc -Wall` used to be considered madness, and if you did it now on a codebase that has been around a while and not kept clean, you'd drown in messages. The key is to turn it on from the very start and fix things when there are 10 warnings instead of 1000.

This decay happens with test suites, too. One or two tests start to fail, and instead of fixing them, people ignore the failures. A bit later, it's five tests, then 10, and pretty soon the programmers see the tests as broken instead of looking at the failures that let things get to the point where there are so many failing tests.

The fix for both situations is similar though; dial down the {warning strictness|number of tests run} until you get a clean {warnings|test-run} then enable them one by one in order of how easy they are to fix.
Obviously the E2E tests were really badly implemented. Implementing solid E2E tests is a skill that needs to be learned like any other software development skill. Most developers don’t know how to do it well.