Hacker News new | ask | show | jobs
by adamb 3636 days ago
In my (limited) experience, treating your CI pipeline like a distributed system is a design smell. It leads to build processes that are difficult to test, fix, and iterate on.

When a build system can only be effectively invoked by CI/CD, it starts to pervert developer incentives. People need to check things in before they can be sure they work. They don't bother with tiny fixes because of the inertia. Flaky jobs get a quick rebuild, because reproducing a build failure locally is complex enough that they'd prefer to avoid it if they can.

Over time, these add up to a system that grows through accretion, which is the enemy of both agility and understandability.

Better is a build process that uses simple, reusable components that work equally well on developers' machines. These tools can be tested, refined, and replaced incrementally, using the same build processes that the rest of your code base does. You can do this without needing coupling your build processes to the specific way(s) that Jenkins (and company) model builds or their configuration.

3 comments

Here's a problem statement for you. You have ~12k tests that takes > 40 hours to run sequentially. What do you do?

I know how we've solved the problem to provide as much validation as possible before shipping something to production and at pretty high rates of code churn. Whereas what you're suggesting is untenable on a large enough project. That's like saying drink your milk and have a hearty breakfast. Nice platitudes but not actual engineering. Our solution is not unique in fact. Shoppify and other big shops follow exact same practices (https://www.youtube.com/watch?v=zWR477ypEsc). Not because they don't know any better and haven't heard of setting up proper build pipelines using principles from immutable infrastructure but because at large enough scale you need mutability.

Jenkins was just an example. We don't use Jenkins but you do need something that manages workers and their lifecycle. Saying reduce your test runtime to 5 minutes and have better engineers and tools doesn't cut it.

Good discussion guys. Please keep going.

Isn't the architecture of your build directly related to both the architecture of your system and your deployment?

If so, why would somebody think that a monolithic app, even one with threading and workers built in, be better than simply engineering your own as you go along? After all, this is supposed to be engineering, right? Not "How to use Jenkins"

I agree that platitudes aren't solutions, but code smells are the kind of thing that lead one to actually take ownership instead of perhaps using the same paradigm only larger, yes?

Apologies if I missed the point, dkarapetyan.

Code smell is a little ill-defined. Given two experienced enough engineers they'll smell different things based on what experiences have led them to that point. The general rough guidelines is I guess "things should be as simple as possible but not simpler" and depending on what sets of requirements you've optimized for it might not smell right to someone who values a different set of requirements.
We suffered with a Jenkins-like solution for a long time before we decided enough was enough and we wanted to use an approach that didn't need as much soul-crushing, CI-specific effort.

If any of our experiences or insights can help others in their own environments, all the better!

I don't know how large a large project is, but our system is pretty large. We build and test for 4 different operating system flavors and way more than that if you incorporate specific versions and distributions. We run end to end user tests against our applications that test functionality across many of these operating systems. We have broken up our tests into functional groups that have parallelism and caching within the groups and the groups themselves run in parallel. In some cases a single developer or build slave has used 40 machines at once to run these tests (this number was only limited by our budget... windows machines are extra expensive on EC2).

In terms of reporting on tests that run in parallel, we built a tool that specializes in exactly that. It collates output from parallel tests, it times out on tests that are hung, it makes sure the build system doesn't kill it if tests are too silent. It also tracks which tests have run against which versions of the codebase in the past and what their outcomes are. We use supporting tools to analyze test flakiness and understand when they are introduced. We have had a lot of success with this approach, as developers debugging weirdness across many tests is less miserable when they can use the same tools that CI does.

Critically, when bugs in those tools are discovered, developers can pinpoint and fix those bugs locally with reasonable ease. Deploying fixes to the test runner (or the logic that allocates workers for the test runner) is like any other change. No need to tinker with Jenkins (or buildbot, etc) config. No need to take the build system down to test that the change is correct. No need to bring up a test version of the build system and experiment with your change there.

We've gone to great lengths to make our system something that's a joy to work with and helped us be very productive across the many different environments we need to operate in.

It's tough to know how much detail is appropriate in comment threads like these. You're absolutely right that there's a lot that needs to come together to make something like what I've described work. I know because we pulled enough of it together to support our own large and heterogeneous projects.

It sounds like you have also thought about this problem a lot. Can you share more about the sorts of tests (language, test library, etc) you have? Perhaps we can break new ground where each of our respective experiences and intuition intersect.

We have a similar design but need to juggle js and python in some interesting ways. There are only so many variations on the theme of CI so I'm not surprised about the convergent design. Our environment is not as heterogenous and we leverage pre-baked AMIs and LXC containers for isolation and reproducibility.

My contention was the emphasis on local reproducibility. In the past I would have said yes, local reproducibility should be a feature of any well designed CI pipeline but nowadays I'm not sure anymore.

Local development environments are optimized for iteration speed at the cost of reproducibility and stability. Whether this is the right decision or not can be debated. CI environment on the other hand is designed for reproducibility and stability. Those sets of requirements are somewhat at odds and you can't optimize for all at the same time. Tools should be shared across local and CI environments as much as possible but not when it comes at the cost of compromising the requirements for each environment.

I disagree. Not that I don't think that simple, reusable components are valuable; They are, and any developer should run tests before sending things off to the cloud. But having things work on developers machines is itself a code smell, because developers have access to them.

Your whole project should build straight from very fresh boxes, and doing builds on developers machines will never be fresh boxes. It's hard to remove the cult knowledge from a development team. My project just discovered a new unlisted dependency when doing a deploy, because every developer knew about it and installed it on their machine beforehand. It was explicitly listed in the build/devtools dependencies, but not supposed to be a runtime dep. Had the developers run tests on a fresh machine, they'd have run into it. (Of course, the CI team had also installed it on the CI boxes, because they used it for debugging.)

For a large class of tests, having the developers run them in the build environment is fine. But you also need to run them in the deploy environment, and to that end developers should hit the CI system. Every test that the CI system runs on each commit should be runnable before the commit. You should have tooling and spare capacity such that the CI system is used to run tests immediately before commit, not right after - That's too late. You should run them whenever a dev sends off for code review. You should run whenever a dev feels like it; If they're in a good spot, run the unit tests, run the CI tests, see what's broken.

Your comments are fantastically correct. Yes, secret dependencies are the worst. We use an OS sandbox to prevent access outside to non-declared dependencies. That same sandbox prevents access to the network unless a build or test target explicitly declares the need to use networking (e.g. for running tests against network services running on local host).

CI runs the same exact build system (though with a few different options so the outputs are easier to during and after the build).

Passing CI is compulsory, as humans aren't allowed to release changes on our team. Humans may only do code review. If and when a change passes code review, it will be deployed automatically once it passes CI.

We use some of the same compute capacity that our CI system uses to scale test runners across many physical machines (though tests run against a pool of freshly cloned VMs using delta disks so we get a pretty big speedup and lots of control over the environment that tests run in).

There's a fascinating correlation between developer machines and build slaves. It's been my experience that needing to install system software of any kind on one usually leads to a headache later. We've gotten it down to just Xcode on OS X and almost just build-essential on Ubuntu.

So in spirit we do exactly what you're saying, we've just found a way to do it while using the same tooling on both CI and developer machines. We also demand that the build slave images are generated straight from install media and a fixed set of files (like those that install Xcode), so the only simple way to add dependencies (i.e. build tools or libraries) is via our build system. Use of apt, homebrew, etc is completely separate for our developers. And if they mess with the build system in a way that allows those files to leak in, the fact that build slaves are pristine means that their change will fail CI and never be deployed.

Does my explanation make sense? Happy to answer follow up questions. Also happy to be shown where our rigor is lacking :)

I forgot to mention that using the same build tool for both CI and developers has a number of other advantages, including artifact caching. When a developer downloads a change, their build will pull artifacts from the caches that build slaves have populated. So in many cases new changes (once deployed) are only built once, across the whole company by the build slave that kicks off the deploy. Everyone just reuses that cached output.

It was this sharing of artifacts that provided some of the impetus to use a sandbox, since a polluted output could poison the cache in hard to detect ways.

We decided to think of our CI system as a distributed job execution engine for one project… now that we know better, we have a maze of twisty jobs configs to unwind.