Hacker News new | ask | show | jobs
by imiric 1221 days ago
> We noticed a strong correlation between crazy utilization spikes and CI failure rates.

This is interesting, and is something I've also suspected on many CI systems that offer free public runners (CircleCI, GitHub Actions, etc.).

For seemingly no reason at all, tests were very flaky and unstable in CI, which couldn't be reproduced on local machines. I tried everything from resource-limited containers, to identically spec'd VMs, and never was able to reproduce certain failures. This made issues very hard to troubleshoot and fix.

Of course, you might say that this unstable environment surfaced race conditions in our tests or product, and that's true, but it's incredibly frustrating to have random failures that are impossible to reproduce locally, and having to wait for the long experiment-push-wait for CI development loop.

I suspect this is caused by over provisioning of the underlying hardware, where many VMs are competing for the same resources. This seems quite frequent on Azure (GH Actions).

In the article's case they patched it by making their environment more stable, which is a solution we can't do on public runners, but I'd caution them that they're only patching the issue, and not really fixing the root cause. The flakiness still exists in their code, and is just not visible when the system is not under stress, but will surface again when you least want it to, possibly in production.

1 comments

Yep, default runners in most CI platforms shared resources so they are prone to produce flakiness (depending on your set up).

That was one of the reasons we ended up setting up our own runners. Didn't mention in the post but we use spot VM instances.