Hacker News new | ask | show | jobs
by jrockway 2396 days ago
We didn't have a merge queue at Google. You rebased if there was a merge conflict, ran through CI again, and hoped there wasn't another merge conflict. I think I ran into merge conflicts maybe once a year, if that.

I think the success of this system breaks down into several parts:

1) Yup, microservices. You could submit your proto change, which would affect all clients, before actually implementing the code that used the new feature. (Or after, in the case of renaming some field from foo to deprecated_foo and refactoring the clients to stop using that field.) That means you could wrangle that change without having to worry about it affecting your actual feature. (Typically proto changes did not cause any breakages since people were very conservative about what changes they would make. Nobody renames all the fields, invalidating dependent code, or renumbers the fields, invalidating all existing messages. You COULD do those things, but nobody ever did.)

2) Clear dependencies in the build system. The CI system only had to run a small set of tests for most changes, because it knew exactly what tests the change would affect. You had to go way out of your way to depend on code without informing the build system. This is very different from every CI system that I've seen outside of Google, which seem to default to running everything and hoping your programming language or build system magically tracks dependencies. It doesn't; Docker for example will happily use random images that it thinks haven't changed, without actually checking if it has changed. (Consider building your app on top of golang:latest. Go is updated, and docker may or may not pull that new base image. Meanwhile, docker will happily clear its build cache if you edit README.md and no code. The result is that 50% of the time you waste 10 minutes rebuilding stuff that didn't change, and 50% of the time you get an outdated build. And nobody seems to care at all!)

3) Being careful about keeping changes small. I don't know what the average CL size is, but I would aim for 100 lines changed rather than 1000 lines changed. This is something that surprised me post-Google, people go away and work for a week and you have a 2000 line PR to review. These are tough to merge and were relatively rare in my experience at Google. It is not always possible to make every change small, but that should be the norm. Figure out how much work you can do in a day, and try to make a CL/PR that is that size. A lot can churn in a week. A lot less churns in a day. If you respected steps 1 and 2, that means your tests will run fast and it's unlikely that your merge will fail between CI and actually merging. If you have 2000 lines of code across 8 services... you'll probably never get it merged. But I am sure that I have successfully merged ginormous changes before, it's just more work.

All in all, my takeaway from this article is that Shopify is huge but I'm surprised that specialized merge tooling was necessary. I wonder what the underlying problem is; do they really have a 1000 developer monolith? Do they not use a proper build system like Bazel?

2 comments

Xoogler here. When I left in 2015 there were definitely teams that used merge queues (i.e. TAP presubmit). Generally these were teams with a more monolithic architecture, like YouTube that had a massive Python mono.
I guess TAP presubmit might be a merge queue... but it seems different from this. There was no requirement that some mechanical system checked that tests passed before your merged your CL. You could merge any code whenever it was approved. If you felt like running the tests, good for you. TAP presubmit is just that mechanical system that runs your tests before executing the merge. That seems like traditional CI to me, not a merge queue.

Jenkins with a Github plugin behaves almost exactly like this system. Every PR basically has tests run 3 times; once for the branch that the PR is on, once for your branch merged to master, and then once after you do the merge and submit it. TAP presubmit did the "once for your branch" and TAP did the post-merge CI.

TAP presubmit didn't really check that the resulting merge was sound, so you would see TAP presubmit pass, your change get merged, and then have the build break anyway because of the race condition. A merge queue would not have this race condition... so I'm not sure Shopify has one either. The more I think about it the more it sounds like they just rewrote Jenkins. (And for that, I don't blame them.)

I have never seen anyone automate what Bors does with Jenkins and have anything approaching decent UX. The closest I've ever seen is a permanent stage branch that sometimes has automatic promotion, little integration with reviews and inevitably breaks every few weeks until some poor soul debugs it.
Point 2 is very important and very hard to get right. For unit tests, there is a clear dependency on the code and you can easily just run a subset of the tests. But wouldn't you have to run any system and integration tests of the affected module, as it's not clear what effects the code change can have? This will blow up CI times again. How did Google deal with this?
Not sure if this actually answers the question, but - Bazel, the build system used at Google, creates dependency graphs (example: https://blog.bazel.build/2015/06/17/visualize-your-build.htm...), which I believe can be used to run tests on any code affected by a change.
Your integration test needed the system that you were integrating with, so you'd have to declare that as a dependency.

My philosophy was to always have integration tests run in the normal CI system. This basically meant creating a test binary that happened to link in the systems you were integrating with, and run tests against that. This is easier when everything is written in the same programming language, and for the cases where it wasn't, I was usually happy with "fakes". (https://testing.googleblog.com/2013/06/testing-on-toilet-fak...)

Other teams really loved the sandbox environment with live instances of everything. They would have some machinery outside the standard CI system to inject their code into this sandbox and run some tests, as well as machinery for keeping their sandbox up to date with production. (And adding test data, etc., etc., which all becomes very complex very quickly.)

Both methodologies have their downsides and upsides.

I generally prefer simplicity and speed; people should be able to run the tests on their workstation 100% of the time without having to set up any external resources. If you have an integration test binary that is built from the build system, this is possible. The downside is that config changes in production can break your system; since you are starting up your own instance of some other team's server, they could theoretically make some config change that breaks your integration. Even if you include their configuration in your in-memory version of their service, there was no guarantee that what is running in production is actually checked in yet. (Debugging in production, emergency rollback to an older prebuilt binary, etc.) These were rare and never caused me problems, however, and not having machinery to maintain a shadow environment meant it was easier to work on the code.

Having a sandbox environment was good because you could "check" (not test) big changes before putting them into production. You could try out your flag flip, database migration, mapreduce, or just load up the website in your browser and send your coworkers a link without affecting production data. And you could test your actual production binary in production-like conditions; as long as you sync'd production changes to your sandbox, your automated test probably ran against something that was very much like production. This let you check for more subtle things like performance regressions before deploying. (I worked on a system to do just that.)

The main problem I had with this method was that it was maintenance-intensive (big teams that used this had entire teams just to maintain the sandbox, and that begat sub teams that maintained the sandbox maintenance) and slow. Building and running another test during CI was relatively fast, but starting up a job in production and scaling it up was significantly slower. This meant that you needed a parallel set of tools to run some subset of this environment locally, and it was always painful. Not having your tests in the standard system meant that downstream dependencies wouldn't see test failures in your system when you made a change, so the "buildcop" would have to detect and fix that.

I found this to be too much overhead, but it is probably necessary when you are developing, say, a mobile application. You will have to write some sort of software to make it possible to try your in-progress code on your personal phone. You will probably want to be able to share links with coworkers. I generally like to push changes to production multiple times a day, and make sure that clients can handle a newer server and still work correctly. This way, as soon as a build passes tests, you can start giving it, say 0.1% of production traffic and keep an eye on the error rates, and promote that to production as quickly as possible. The biggest problem I've run into with this strategy is that 0.1% of Google's traffic is way more than enough for a good canary, but at other places I've worked... 0.1% of traffic might be one request over several days. In that case, you have to have staging and manually bug people to try it out. Sometimes I wonder if that kind of software is worth writing at all, to be perfectly honest. If you get one request a day, maybe just make it open a support ticket, and hire 2 support engineers instead of one software engineer. But I digress ;)

Tangentially:

I've seen several blog posts from Google about using fakes and 'hermetic servers' for testing. We use GCP for our product, and unfortunately, Google doesn't seem to care much about making this easy. For example, I think I saw only one or two languages for which the Google Storage client libraries provided "fakes" of a Google Storage server. For PubSub (and maybe one or two other services?) there is the PubSub Emulator, which is unfortunately in Java and isn't supported by any of the CLI tools.

For all their love of fakes and hermetic servers, it would be awesome if they provided them for all the GCP services.

Wow, thanks for the detailed reply. You mentioned a couple of implementations that I hadn't thought about. But I guess the short version would be, as so often: testing systems is hard, and there's no one-fits-all solution.