| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jgraham 3925 days ago

But we (Mozilla) are also doing many of the same things as the Chromium team here. In particular there is work in progress to automatically "ignore" the results of known-flaky tests until we detect that there has been a change in the rate of flakiness, at which point we will — assuming all goes to plan — trigger new test runs until we can determine the point at which the regression was introduced.

I think one of the lessons we've learnt is that with a browser-type project it's very hard to make test runs fully deterministic, for both technical and human reasons.

The technical reasons are touched on in the original article: these are complex codebases with lots of moving parts and lots of environmental dependencies. Of course there are various tactics to try and combat this; for example there is a wiki page dedicated to innocuous-looking code that leads to intermittent tests [1].

The human reasons centre around the difficulty of getting people to care about spending time fixing a test that fails one time in 1,000 (which is still very noticeable when you are running it hundreds of times a day). Unless the issue is something that fits a known pattern it's hard work, difficult to tell if your fix even worked, and not likely to be considered a top priority due to the diffuse, hard to quantify, nature of the benefits.

I think the fact that both Google and Mozilla still have significant problems with intermittents despite talented engineering staff and it having been a known problem for years implies that some of the standard thinking about making tests fully deterministic simply doesn't apply; for this kind of work you have to embrace — or at least accept — the randomness, and look for ways to get the data you need despite the noise.

[1] https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding...

1 comments

cpeterso 3925 days ago

That's a good point that making tests fully deterministic is not actually possible. But users aren't deterministic either, so we must accept noisy test environments because that's what users see. Tracking changes in the rate of flakiness is an interesting idea.

Could you run "all" the identify flaky tests by running all the tests 100 times on the same stable build (like the latest ESR)? Is it even possible to write a test that could pass 100 times in a row? :)

jgraham 3925 days ago

Running a test N times will certainly detect some fraction of all the flaky tests. It's something we occasionally do manually to work out if e.g. a certain intermittent is (likely) fixed and it's something that we'd like to do more to quarantine new tests.

Unfortunately there are various confounding factors that mean many intermittent tests would look clean in such a run might nevertheless be problematic. For example if you only run tests that you think are intermittent problems that are triggered by state left from a previous test won't be found. This is one reason that we've been trying to run particularly problematic test types (e.g. firefox browser-chrome tests) in smaller groups restarting the browser with a clean profile between groups to clear the state. A group size of 1 would obviously be ideal here, but when you have thousands of tests and limited resources it's not practical.

The other problem is tests that have unexpected sensitivity to the environment. For example the other day DNS was being slow on the test infrastructure. This isn't a problem for most tests since they use something like /etc/hosts. But some tests were intentionally trying to use a non-resolving domain and those tests sudden started to randomly time out.