Hacker News new | ask | show | jobs
by latortuga 2396 days ago
Well, the problem is that master is a bottleneck. Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with. At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build). It's also wasteful to build each branch against current master because what is "current" will not be when the branch is ready to merge.

Perhaps this problem is what microservices are meant to solve. When you can't coherently integrate code fast enough, attack the bottleneck (master) by splitting it (multiple services).

8 comments

Microservices don't really help with this. They just force you to think about your interfaces, but you should do that in a monolith too. If you interfaces are reasonably stable, merging is unlikely to break master if the branch was green before, if your interfaces change rapidly you get problems with microservices too, just one level higher up, where you try to integrate them into a usable product.
One of the things I think microservices does help with is thinking about systems being composed of components that are being developed at different velocities and different tolerances of risk.

Imagine an e-commerce site broken into a bunch of services including search and checkout. The search team is making updates daily, trying to improve ranking and drive conversion. The checkout team (assuming that the site is mature and has hit some design equilibrium) may only be releasing changes every couple of months, and if a bug is introduced, the financial impact is a lot higher.

By not bundling the outputs of very different teams together, you can help those that want to "move fast and break things" with their "moving fast" goal, and de-risk breaking everything by reducing the surface area of changes. Microservices-based architectures are a way to reduce friction caused by the structure of your organization and is one outcome of an Inverse Conway Maneuver.

They do help if a single team of 5-7 developers own a set of microservices; it's unlikely you will have tons of PRs to merge all at once in a single repository with a smaller team. Granted, the ownership is is a bit more clear when talking about a self-contained system that a team owns: https://scs-architecture.org/vs-ms.html

In the SCS literature, you would integrate via async mechanisms across SCSes, provide versioned interfaces, and enforce via consumer-driven contract testing like Pact: https://pact.io

> Perhaps this problem is what microservices are meant to solve.

Kinda. Microservices have always been an organizational solution; they're a way to shard your company's work output. Usually that's API contracts, but whatever mechanisms are bottlenecked on the work output is affected, including how many concurrent builds are running due to how many people are touching the code at the same time.

This paper might be of interest to you on this very subject:

https://eng.uber.com/research/keeping-master-green-at-scale/

We didn't have a merge queue at Google. You rebased if there was a merge conflict, ran through CI again, and hoped there wasn't another merge conflict. I think I ran into merge conflicts maybe once a year, if that.

I think the success of this system breaks down into several parts:

1) Yup, microservices. You could submit your proto change, which would affect all clients, before actually implementing the code that used the new feature. (Or after, in the case of renaming some field from foo to deprecated_foo and refactoring the clients to stop using that field.) That means you could wrangle that change without having to worry about it affecting your actual feature. (Typically proto changes did not cause any breakages since people were very conservative about what changes they would make. Nobody renames all the fields, invalidating dependent code, or renumbers the fields, invalidating all existing messages. You COULD do those things, but nobody ever did.)

2) Clear dependencies in the build system. The CI system only had to run a small set of tests for most changes, because it knew exactly what tests the change would affect. You had to go way out of your way to depend on code without informing the build system. This is very different from every CI system that I've seen outside of Google, which seem to default to running everything and hoping your programming language or build system magically tracks dependencies. It doesn't; Docker for example will happily use random images that it thinks haven't changed, without actually checking if it has changed. (Consider building your app on top of golang:latest. Go is updated, and docker may or may not pull that new base image. Meanwhile, docker will happily clear its build cache if you edit README.md and no code. The result is that 50% of the time you waste 10 minutes rebuilding stuff that didn't change, and 50% of the time you get an outdated build. And nobody seems to care at all!)

3) Being careful about keeping changes small. I don't know what the average CL size is, but I would aim for 100 lines changed rather than 1000 lines changed. This is something that surprised me post-Google, people go away and work for a week and you have a 2000 line PR to review. These are tough to merge and were relatively rare in my experience at Google. It is not always possible to make every change small, but that should be the norm. Figure out how much work you can do in a day, and try to make a CL/PR that is that size. A lot can churn in a week. A lot less churns in a day. If you respected steps 1 and 2, that means your tests will run fast and it's unlikely that your merge will fail between CI and actually merging. If you have 2000 lines of code across 8 services... you'll probably never get it merged. But I am sure that I have successfully merged ginormous changes before, it's just more work.

All in all, my takeaway from this article is that Shopify is huge but I'm surprised that specialized merge tooling was necessary. I wonder what the underlying problem is; do they really have a 1000 developer monolith? Do they not use a proper build system like Bazel?

Xoogler here. When I left in 2015 there were definitely teams that used merge queues (i.e. TAP presubmit). Generally these were teams with a more monolithic architecture, like YouTube that had a massive Python mono.
I guess TAP presubmit might be a merge queue... but it seems different from this. There was no requirement that some mechanical system checked that tests passed before your merged your CL. You could merge any code whenever it was approved. If you felt like running the tests, good for you. TAP presubmit is just that mechanical system that runs your tests before executing the merge. That seems like traditional CI to me, not a merge queue.

Jenkins with a Github plugin behaves almost exactly like this system. Every PR basically has tests run 3 times; once for the branch that the PR is on, once for your branch merged to master, and then once after you do the merge and submit it. TAP presubmit did the "once for your branch" and TAP did the post-merge CI.

TAP presubmit didn't really check that the resulting merge was sound, so you would see TAP presubmit pass, your change get merged, and then have the build break anyway because of the race condition. A merge queue would not have this race condition... so I'm not sure Shopify has one either. The more I think about it the more it sounds like they just rewrote Jenkins. (And for that, I don't blame them.)

I have never seen anyone automate what Bors does with Jenkins and have anything approaching decent UX. The closest I've ever seen is a permanent stage branch that sometimes has automatic promotion, little integration with reviews and inevitably breaks every few weeks until some poor soul debugs it.
Point 2 is very important and very hard to get right. For unit tests, there is a clear dependency on the code and you can easily just run a subset of the tests. But wouldn't you have to run any system and integration tests of the affected module, as it's not clear what effects the code change can have? This will blow up CI times again. How did Google deal with this?
Not sure if this actually answers the question, but - Bazel, the build system used at Google, creates dependency graphs (example: https://blog.bazel.build/2015/06/17/visualize-your-build.htm...), which I believe can be used to run tests on any code affected by a change.
Your integration test needed the system that you were integrating with, so you'd have to declare that as a dependency.

My philosophy was to always have integration tests run in the normal CI system. This basically meant creating a test binary that happened to link in the systems you were integrating with, and run tests against that. This is easier when everything is written in the same programming language, and for the cases where it wasn't, I was usually happy with "fakes". (https://testing.googleblog.com/2013/06/testing-on-toilet-fak...)

Other teams really loved the sandbox environment with live instances of everything. They would have some machinery outside the standard CI system to inject their code into this sandbox and run some tests, as well as machinery for keeping their sandbox up to date with production. (And adding test data, etc., etc., which all becomes very complex very quickly.)

Both methodologies have their downsides and upsides.

I generally prefer simplicity and speed; people should be able to run the tests on their workstation 100% of the time without having to set up any external resources. If you have an integration test binary that is built from the build system, this is possible. The downside is that config changes in production can break your system; since you are starting up your own instance of some other team's server, they could theoretically make some config change that breaks your integration. Even if you include their configuration in your in-memory version of their service, there was no guarantee that what is running in production is actually checked in yet. (Debugging in production, emergency rollback to an older prebuilt binary, etc.) These were rare and never caused me problems, however, and not having machinery to maintain a shadow environment meant it was easier to work on the code.

Having a sandbox environment was good because you could "check" (not test) big changes before putting them into production. You could try out your flag flip, database migration, mapreduce, or just load up the website in your browser and send your coworkers a link without affecting production data. And you could test your actual production binary in production-like conditions; as long as you sync'd production changes to your sandbox, your automated test probably ran against something that was very much like production. This let you check for more subtle things like performance regressions before deploying. (I worked on a system to do just that.)

The main problem I had with this method was that it was maintenance-intensive (big teams that used this had entire teams just to maintain the sandbox, and that begat sub teams that maintained the sandbox maintenance) and slow. Building and running another test during CI was relatively fast, but starting up a job in production and scaling it up was significantly slower. This meant that you needed a parallel set of tools to run some subset of this environment locally, and it was always painful. Not having your tests in the standard system meant that downstream dependencies wouldn't see test failures in your system when you made a change, so the "buildcop" would have to detect and fix that.

I found this to be too much overhead, but it is probably necessary when you are developing, say, a mobile application. You will have to write some sort of software to make it possible to try your in-progress code on your personal phone. You will probably want to be able to share links with coworkers. I generally like to push changes to production multiple times a day, and make sure that clients can handle a newer server and still work correctly. This way, as soon as a build passes tests, you can start giving it, say 0.1% of production traffic and keep an eye on the error rates, and promote that to production as quickly as possible. The biggest problem I've run into with this strategy is that 0.1% of Google's traffic is way more than enough for a good canary, but at other places I've worked... 0.1% of traffic might be one request over several days. In that case, you have to have staging and manually bug people to try it out. Sometimes I wonder if that kind of software is worth writing at all, to be perfectly honest. If you get one request a day, maybe just make it open a support ticket, and hire 2 support engineers instead of one software engineer. But I digress ;)

Tangentially:

I've seen several blog posts from Google about using fakes and 'hermetic servers' for testing. We use GCP for our product, and unfortunately, Google doesn't seem to care much about making this easy. For example, I think I saw only one or two languages for which the Google Storage client libraries provided "fakes" of a Google Storage server. For PubSub (and maybe one or two other services?) there is the PubSub Emulator, which is unfortunately in Java and isn't supported by any of the CLI tools.

For all their love of fakes and hermetic servers, it would be awesome if they provided them for all the GCP services.

Wow, thanks for the detailed reply. You mentioned a couple of implementations that I hadn't thought about. But I guess the short version would be, as so often: testing systems is hard, and there's no one-fits-all solution.
By virtue of having a queue of PRs that need to test & merge, you could pipeline this thing out pretty substantially.

The implication here being that a queue must be processed in-order, so you will ultimately have a perfect sequence of future commits to speculate against, and can incrementally build up each hypothetical future master state for a test build on one of any number of parallel build agents. As the queue depth grows, you would see higher and higher throughput.

> Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with.

Google does it with 50 times the developer count.

> At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build).

True, it is impossible to catch all errors like this, but you can catch almost every error by building and testing it against current master and then merge it with the master 20 minutes later when the build is done. I have seen maybe one build breakage a year being introduced due to this in projects I've worked on, so it isn't a big deal.

For even better accuracy you can use a tool that will run tests against speculative merge states. Zuul[1] is an open source project that supports it out of the box.

[1] https://zuul-ci.org/docs/zuul/user/gating.html

> building and testing it against current master and then merge it with the master 20 minutes later when the build is done.

And I'm pretty sure that is the way Google does it too. Test a commit against current master, if tests are green commit. Then run tests against master again (and I think this stage might not run for every single commit) to see if anything broke on the rare times there was an actual conflict. If that run was red, which should be rare, then you can have the system do a bisect to find the offending commit, or just run all the ones that haven't been individually tested.

You have no idea how Google solved it. Basically everyone with a Monorepo (except Google) implements it as a cargo cult best practice. Mindlessly copying Google without understanding how Google actually does it.
Yet it seems like large companies mostly prefer monorepos, so while it takes investment to have such a monorepo, it seems the benefits are worth the investment.
Google, Microsoft, Facebook and Twitter prefer monorepos but this is not indicative of most large orgs.

You'll notice that those listed have had to customize or creat new vcs's to meet their needs.

https://news.ycombinator.com/item?id=17605371 https://news.ycombinator.com/item?id=11789182 https://medium.com/@maoberlehner/monorepos-in-the-wild-33c6e...

This is appeal to accomplishment fallacy. Because large companies have a lot of money, whatever they do must be great. But this is false - they do what they do because they are large companies, not because it is a good idea.

At scale, managing complexity can require either a lot of coordination, or a lot of careful planning. Large companies (especially tech companies) don't do either well, so they pick architectures that remove choices, and iterate on them until they are workable. And they have the money and workforce to do it.

This is the problem external libraries were created to solve, in a time when it was a much harder problem.

Microservices are the same kind of solution, with the same gains and costs for this specific problem.