| HN Mirror

Running a test N times will certainly detect some fraction of all the flaky tests. It's something we occasionally do manually to work out if e.g. a certain intermittent is (likely) fixed and it's something that we'd like to do more to quarantine new tests.

Unfortunately there are various confounding factors that mean many intermittent tests would look clean in such a run might nevertheless be problematic. For example if you only run tests that you think are intermittent problems that are triggered by state left from a previous test won't be found. This is one reason that we've been trying to run particularly problematic test types (e.g. firefox browser-chrome tests) in smaller groups restarting the browser with a clean profile between groups to clear the state. A group size of 1 would obviously be ideal here, but when you have thousands of tests and limited resources it's not practical.

The other problem is tests that have unexpected sensitivity to the environment. For example the other day DNS was being slow on the test infrastructure. This isn't a problem for most tests since they use something like /etc/hosts. But some tests were intentionally trying to use a non-resolving domain and those tests sudden started to randomly time out.