Hacker News new | ask | show | jobs
by sascha_sl 1501 days ago
Self-hosted runners at scale are a nightmare. Random failures all around, exit code is whatever it wants to be... you'd think with the runner code in the open all that could be fixed, but no, it's a very confusingly written MS C# codebase that has clearly grown over the years as TFS Pipelines. Even the GitHub developers get confused when they tried to implement ephemeral runners and then figured out the scheduler just pipelines new jobs to runners and drops them permanently (shows as "waiting for runner for 15-20 minutes, then cancelled to the user) if you kill the runner anyway. And the worst: The component failing most often is the scheduler, and that runs at GitHub. It feels like it was written by an intern. It doesn't even do FIFO. Old jobs generally wait _longer_. We also have stuck jobs all the time because the scheduler lost it. GitHub Premium Enterprise Support just keeps saying to contact them whenever it happens and they're "working on it". My team has been working on making this entire crap work for almost an entire year on and off, and things only got marginally better. To the point we'll be using the new hosted runners with fixed egress IP and custom sizes instead.

Good job, you filled a hole in your spec sheet with literal shit and then sold us the only alternative a year later when it is unfeasible to migrate away again ("but we have a large Microsoft contract" politics keep us from doing that anyway, but that aside).