|
|
|
|
|
by xjia
928 days ago
|
|
I had a similar experience with ARC (actions-runner-controller). One of the machines in the fleet failed to sync its clock via NTP. Once a job X got scheduled to it, the runner pod failed authentication due to incorrect clock time, and then the whole ARC system started to behave incorrectly: job X was stuck without runners, until another workflow job Y was created, and then X got run but Y became stuck. There were also other wierd behaviors like this so I eventually rebuilt everything based on VMs and stopped using ARC. Using VMs also allowed me to support the use of the official runner images [0], which is good for compatibility. I feel more people would benefit from managed "self-hosted" runners, so I started DimeRun [1] to provide cheaper GHA runners for people who don't have the time/willingness to troubleshoot low-level infra issues. [0]: https://github.com/actions/runner-images
[1]: https://dime.run |
|
If something fails and you don't have idle runners (hence wasting unnecessary resources), things start to snowball.