Hacker News new | ask | show | jobs
by danpalmer 1221 days ago
The short answer is "do as little as possible". What this means in practice is breaking down every step of CI, figuring out the dependencies for that step, and then ordering the graph of dependencies such that you start as much as possible as early as possible. This process also usually shows you where things are slow and what the critical path is.

Unfortunately, doing this in most CI services is actually quite difficult. It usually means a complex graph of execution, complex cache usage, and being careful to not re-generate artifacts you don't need to.

In my experience, building this, at a level of reliability necessary for a team of more than a few devs, is hard. Jenkins can do it reliably, but doing it fast is hard because the caching primitives are poor. Circle and Gitlab can do it quickly, but the execution dependency primitives aren't great and the caches can be unreliable. Circle also has _terrible_ network speeds, so doing too much caching slows down builds. GitHub Actions is pretty good for all of this, but it's still a ton of work.

The best answer is to use a build system or CI system that is modelled in a better way. Things like Bazel essentially manage this graph for you in a very smart way, but they only really work when you have a CI system designed to run Bazel jobs, and there aren't many of these that I've seen. It's a huge paradigm shift, and requires quite a lot of dev work to make happen.

2 comments

It's so surprising to me that this is such a poorly supported paradigm in commodity CI systems. Caching artifacts and identifying slow stages is like... super important for scaling CI for large enough orgs. We need better tools!
The more time you spend debugging this and the worse job you do at it, the more money they make.
While it sounds cynical... it doesn't strike me as 'wrong' entirely. It's a non-trivial problem, but until some service provides great tools to handle this, and makes the experience 10x better (to encourage more use/experimentation/etc), everyone will keep offering the same experience all around. If a service could automatically cut build times by, say, 70%, that's a lot of revenue they may lose from charging for the build time. They could raise the price, or hope that enough new people get onboard to make up the loss... ?
That feels counterintuitive to me. I would probably use even more CI minutes if they had higher value.
Maybe it's because CI service providers don't want to be responsible for a lot cache storage.

Given how lacking this feature is, maybe a CI vendor could offer it as a premium paid feature.

CircleCI sort of do! They have something called Docker Layer Caching, which basically puts all the Docker layers from your previous build on the execution machine.

The problem is that it's a) very slow to download those layers from their cache storage, and b) very expensive. It works out to costing ~20 minutes of build time.

I have found Gitlab and runners the best option here.
The problem I had with GitLab was that the mechanisms for controlling dependencies between stages were fairly basic. They only added them in ~2020 I think, and they weren't well documented.

Additionally, there's no cache guarantees between jobs within one execution. This means that you can't reliably cache an artifact in one job, and then share it with multiple downstream jobs. It mostly works, but it's hard to debug when it doesn't, especially if the cache artifact isn't versioned.

GitLab is "fine", and has some nice usability features for basic pipelines, but it's definitely not doing anything better than the other major providers with respect to these problems.

Dependency controls have improved quite a bit. They went through a couple of variations of this and the current solution is nice.

I haven't ever run into an issue with artifact hand off to this point though. Maybe it's one of the more rare concerns, but it's not something I've experienced (fortunately). I imagine it would be a concern to debug though.