Hacker News new | ask | show | jobs
by gregdoesit 2390 days ago
> If you insist on no errors on master ever you will kill throughput.

Unless you solve this engineering problem with tooling. At Uber, the full-blown CI mobile test suite takes over 30 minutes to run on a development machine (linting, unit test, UI tests - most of this time being the long-running UI tests, specific to native mobile). So we only do incremental runs locally, and have a submit queue, which parallelises this work and merges only changes that don’t break, into master. And we have one repository that hundreds of engineers work on.

It’s not an easy problem and the solution is also rather complex, but it keeps master at green - with the trade-off of having needed to build and maintain this system. See it discussed on HN a while ago: https://news.ycombinator.com/item?id=19692820

3 comments

How do you handle situations like that: multiple dvelopers added merge requests to queue, the changes they made are mutually exclusive (automatic rebase wont work). What happens when the first branch gets merged to master and next 10 are still in the queue ? How do you mitigate that to decrease development cycle ?

Lets just say in my company it also takes 30m to run tests and 4h to run them on merge pipeline with FATs and CORE tests.. Its way too long and highly cripples productivity.

A lot of the below comments touched on things we do (verifying that changesets are independent, breaking tests into smaller pieces, prioritising changes that are likely to succeed). They add up and the approach does become more complex. We wrote an ACM white paper with more of the details[1]. It’s the many edge cases and several optimisation problems that turn this into an interesting theoretical and practical problem.

[1] http://delivery.acm.org/10.1145/3310000/3303970/a29-ananthan...

Sorry, but that link points to "not found" page.
I hope it is possible to decompose this in two problems:

1. Dependencies in incompatible Merge Requests that need to be accounted for, see https://docs.gitlab.com/ee/user/project/merge_requests/merge... on how to do that.

2. Most merge requests can merge in previous changes changes, for that you can use merge trains as detailed in my other comment https://news.ycombinator.com/item?id=21679515

Well first step is to optimize, parallelize and refactor so you do not have a single process that takes hours, but many separate ones you can run at once in a cluster.

If those get too expensive to run or you cannot speed them up them you have to do what Chromium does: run them post commit then bisect and revert any changes that break the tests. If things are truly broken you close the tree for a bit while you get the break reverted or fixed.

Also the system that is landing changes tests the optimistically in parallel assuming they will all succeed, so it does land a change only 30 minutes for example.

What you describe is typically an architecture problem: if you have a good architecture in place the problem won't happen because you have already broken your system up so that those places that 10 completely different developers need to touch do not exist in the first place. You need to hire more senior developers to think about this problem and fix it. You should be able to assign every area of code to a small team of developers who work together and coordinate their changes to that area. (even with common code ownership you quickly specialize just because on a large project you cannot understand everything)

There are exceptions. Sometimes there is a management problem: management has been told some things cannot be done in parallel because you couldn't mitigate the problem in architecture and they failed to apply project management practices to ensure the developers worked serially.

Sometimes there is a team problem: the 10 developers have been placed on the same team to work on the same thing, and despite all that they still failed to coordinate among themselves to ensure that the changes happened in order.

The robot won’t merge a change in the queue if it can’t be merged or tests fail. The changeset would be left open and the developer notified to fix it.

The whole process assumes that multiple changes in the queue don’t depend on each other, if they did, it should all be in the same changeset.

It assumes most do not, but it’s entirely possible for someone to change a common library which makes several down stream changes wait. Even if there are no merge conflicts, if they effect the same tests, changes will have to wait.
don't work at uber but have similar problems at my job and i'm quite convinced the problems you ask about are part of the 'not easy' in the OP comment. maybe they can queue whole branches instead of single checkins?
This may surprise you, but tests only help you find breaking changes. They don't guarantee it. To guarantee anything costs a lot.
Tests are there to prove everything you thought of works correctly. They do nothing to find things you didn't think of. However with effort ($$$) you can become very creative in thinking about failure cases that you then test.

If you want to find problems you didn't think of formal proofs are the only think I have heard of. However formal proofs only work if you can think of the right constraints (I forget the correct term) which isn't easy.

Note that the two are not substitutes for each other. While there is overlap there are classes of errors that one along will not catch. For most projects though it is more cost effective to live with some bugs for a long time than to spend enough money on either of the above to find it ahead of time. Different projects have different needs (games vs medical devices...)

An alternative system includes optimizations: 1) don't use a monorepo, 2) don't run tests that have nothing to do with the code changed. Both require redesign of code structure, testing, execution, but both remove the inherent limits of integration.

Nobody seems to talk about this and I don't know why. It would remove integration complexity and speed up testing. We do the same thing for CD and nobody seems to have a problem with it...

The queue and test system Uber and Google use for their monorepos essentially do both of those. The restructuring you mention was to use a build system such as Bazel or Buck universally.

1) Two changes which don’t effect intersecting parts of the repo are landed separately. Similar to having infinite separate repos.

2) Only the tests that your code effects are run.

This is all possible because Bazel let’s you look at a commit and determine with certainty which test needs to run a and which targets are effected.

That is good to hear, but I'm interested in finding the patterns that makes this feasible without a build tool designed for massively parallel building, testing, and integration of a single codebase. A lot of the historic reasons for Google's build system come down to "we just like the monorepo but it needs complex tooling to work".
For a complex project you need complex tooling. A mono-repo and a multi-repo system have different needs, but both need complex tooling to work. Neither is inherently better than the other, there are pros and cons. Sometimes those are compelling (which is why a few projects at google are not part of the mono-repo)

For me I prefer the pros and cons of multi-repo. However sometimes I wish I could do the large cross project refactoring that a mono-repo would make easy.