| > We noticed a strong correlation between crazy utilization spikes and CI failure rates. This is interesting, and is something I've also suspected on many CI systems that offer free public runners (CircleCI, GitHub Actions, etc.). For seemingly no reason at all, tests were very flaky and unstable in CI, which couldn't be reproduced on local machines. I tried everything from resource-limited containers, to identically spec'd VMs, and never was able to reproduce certain failures. This made issues very hard to troubleshoot and fix. Of course, you might say that this unstable environment surfaced race conditions in our tests or product, and that's true, but it's incredibly frustrating to have random failures that are impossible to reproduce locally, and having to wait for the long experiment-push-wait for CI development loop. I suspect this is caused by over provisioning of the underlying hardware, where many VMs are competing for the same resources. This seems quite frequent on Azure (GH Actions). In the article's case they patched it by making their environment more stable, which is a solution we can't do on public runners, but I'd caution them that they're only patching the issue, and not really fixing the root cause. The flakiness still exists in their code, and is just not visible when the system is not under stress, but will surface again when you least want it to, possibly in production. |
That was one of the reasons we ended up setting up our own runners. Didn't mention in the post but we use spot VM instances.