Hacker News new | ask | show | jobs
by hinkley 2393 days ago
> The scenario we want to avoid is that a faulty commit makes it to the main branch.

Close. The scenario we want to minimize is faulty code on the main branch. As your team grows, as the number of commits go up, it becomes a game of chance. Sooner or later something will get through. The more new teammates you have, the more often that will happens.

This is an inescapable cost of growth. The cost of promoting people to management. The cost of starting new projects. Occasionally you can avoid it as a cost of turnover, but you will have turnover at some point.

What matters most is how long the code is "broken" (including false positives) before it is identified, mitigated, and fully corrected. The amount of work you can do to keep these number relatively stable in the face of change is profound.

If you insist on no errors on master ever you will kill throughput. You will create situations where the only failures are big, which is neck deep in the philosophy that CI rejects: that problems are to be avoided instead of embraced and conquered.

4 comments

> If you insist on no errors on master ever you will kill throughput.

There are a large number of automated tools which will help you prevent merging code that could break master: https://github.com/chdsbd/kodiak#prior-art--alternatives. The basic approach is to make a new branch from master, apply one or more commits on top of that branch, run the tests, and if tests pass, merge those commits (with fast-forward) back onto master. This makes it very difficult to get broken commits on master, as they have to pass the tests before. It is possible if you have a flaky test suite, but in my experience it happens very rarely, and is usually very easy to fix if something creeps in. In my experience, they speed up throughput, not slow it down, especially when you account for the disruption that merging broken code to master can be.

https://graydon2.dreamwidth.org/1597.html has a good discussion of this:

The Not Rocket Science Rule Of Software Engineering:

automatically maintain a repository of code that always passes all the tests

The problem with a popular repository can be that running the tests can take longer than the time you have between merges.

In GitLab we made merge trains https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipeli... to solve this problem automatically.

With merge trains the merge requests with a passing feature branch is placed in a queue and tests are run against the combination of that branch and all the branches before it merged in. Since tests will pass 95%+ of the time the feature branches passes this can speed up the amount of merges you can get into master by 10x or more.

automated tests don't necessarily catch all errors
> If you insist on no errors on master ever you will kill throughput.

Unless you solve this engineering problem with tooling. At Uber, the full-blown CI mobile test suite takes over 30 minutes to run on a development machine (linting, unit test, UI tests - most of this time being the long-running UI tests, specific to native mobile). So we only do incremental runs locally, and have a submit queue, which parallelises this work and merges only changes that don’t break, into master. And we have one repository that hundreds of engineers work on.

It’s not an easy problem and the solution is also rather complex, but it keeps master at green - with the trade-off of having needed to build and maintain this system. See it discussed on HN a while ago: https://news.ycombinator.com/item?id=19692820

How do you handle situations like that: multiple dvelopers added merge requests to queue, the changes they made are mutually exclusive (automatic rebase wont work). What happens when the first branch gets merged to master and next 10 are still in the queue ? How do you mitigate that to decrease development cycle ?

Lets just say in my company it also takes 30m to run tests and 4h to run them on merge pipeline with FATs and CORE tests.. Its way too long and highly cripples productivity.

A lot of the below comments touched on things we do (verifying that changesets are independent, breaking tests into smaller pieces, prioritising changes that are likely to succeed). They add up and the approach does become more complex. We wrote an ACM white paper with more of the details[1]. It’s the many edge cases and several optimisation problems that turn this into an interesting theoretical and practical problem.

[1] http://delivery.acm.org/10.1145/3310000/3303970/a29-ananthan...

Sorry, but that link points to "not found" page.
I hope it is possible to decompose this in two problems:

1. Dependencies in incompatible Merge Requests that need to be accounted for, see https://docs.gitlab.com/ee/user/project/merge_requests/merge... on how to do that.

2. Most merge requests can merge in previous changes changes, for that you can use merge trains as detailed in my other comment https://news.ycombinator.com/item?id=21679515

Well first step is to optimize, parallelize and refactor so you do not have a single process that takes hours, but many separate ones you can run at once in a cluster.

If those get too expensive to run or you cannot speed them up them you have to do what Chromium does: run them post commit then bisect and revert any changes that break the tests. If things are truly broken you close the tree for a bit while you get the break reverted or fixed.

Also the system that is landing changes tests the optimistically in parallel assuming they will all succeed, so it does land a change only 30 minutes for example.

What you describe is typically an architecture problem: if you have a good architecture in place the problem won't happen because you have already broken your system up so that those places that 10 completely different developers need to touch do not exist in the first place. You need to hire more senior developers to think about this problem and fix it. You should be able to assign every area of code to a small team of developers who work together and coordinate their changes to that area. (even with common code ownership you quickly specialize just because on a large project you cannot understand everything)

There are exceptions. Sometimes there is a management problem: management has been told some things cannot be done in parallel because you couldn't mitigate the problem in architecture and they failed to apply project management practices to ensure the developers worked serially.

Sometimes there is a team problem: the 10 developers have been placed on the same team to work on the same thing, and despite all that they still failed to coordinate among themselves to ensure that the changes happened in order.

The robot won’t merge a change in the queue if it can’t be merged or tests fail. The changeset would be left open and the developer notified to fix it.

The whole process assumes that multiple changes in the queue don’t depend on each other, if they did, it should all be in the same changeset.

It assumes most do not, but it’s entirely possible for someone to change a common library which makes several down stream changes wait. Even if there are no merge conflicts, if they effect the same tests, changes will have to wait.
don't work at uber but have similar problems at my job and i'm quite convinced the problems you ask about are part of the 'not easy' in the OP comment. maybe they can queue whole branches instead of single checkins?
This may surprise you, but tests only help you find breaking changes. They don't guarantee it. To guarantee anything costs a lot.
Tests are there to prove everything you thought of works correctly. They do nothing to find things you didn't think of. However with effort ($$$) you can become very creative in thinking about failure cases that you then test.

If you want to find problems you didn't think of formal proofs are the only think I have heard of. However formal proofs only work if you can think of the right constraints (I forget the correct term) which isn't easy.

Note that the two are not substitutes for each other. While there is overlap there are classes of errors that one along will not catch. For most projects though it is more cost effective to live with some bugs for a long time than to spend enough money on either of the above to find it ahead of time. Different projects have different needs (games vs medical devices...)

An alternative system includes optimizations: 1) don't use a monorepo, 2) don't run tests that have nothing to do with the code changed. Both require redesign of code structure, testing, execution, but both remove the inherent limits of integration.

Nobody seems to talk about this and I don't know why. It would remove integration complexity and speed up testing. We do the same thing for CD and nobody seems to have a problem with it...

The queue and test system Uber and Google use for their monorepos essentially do both of those. The restructuring you mention was to use a build system such as Bazel or Buck universally.

1) Two changes which don’t effect intersecting parts of the repo are landed separately. Similar to having infinite separate repos.

2) Only the tests that your code effects are run.

This is all possible because Bazel let’s you look at a commit and determine with certainty which test needs to run a and which targets are effected.

That is good to hear, but I'm interested in finding the patterns that makes this feasible without a build tool designed for massively parallel building, testing, and integration of a single codebase. A lot of the historic reasons for Google's build system come down to "we just like the monorepo but it needs complex tooling to work".
For a complex project you need complex tooling. A mono-repo and a multi-repo system have different needs, but both need complex tooling to work. Neither is inherently better than the other, there are pros and cons. Sometimes those are compelling (which is why a few projects at google are not part of the mono-repo)

For me I prefer the pros and cons of multi-repo. However sometimes I wish I could do the large cross project refactoring that a mono-repo would make easy.

It's possible (and not that hard) to define an integration process that prevents faulty commits from being integrated to the main branch.

> If you insist on no errors on master ever you will kill throughout.

Not sure why you believe this. It hasn't been my experience; just the opposite, in fact. By using CI in conjunction with a process that prevents errors on master, everything goes more smoothly, because people don't get stalled by the broken master.

"It's possible (and not that hard) to define an integration process that prevents faulty commits from being integrated to the main branch. "

You should strive to do that but you shouldn't be surprised that despite all effort mistakes still happen from time to time.

Sure, mistakes happen, which is why the process is typically automated. You have to go out of your way to merge faulty code. It's rare.

Not saying mistakes can't happen, but the person I was replying to didn't seem to be aware of this tooling.

You're both right.

The healthy mentality is to realize mistakes will happen. This creates a healthier culture when things do break.

However, you should take every step to ensure it doesn't happen. You should act as though you want to prevent all faults from hitting you master branch.

It depends on what you means by “errors on master”. Tests won’t catch all possible bugs in production.
Agree. Some faulty commits may go through. But then you strengthen your test suite to prevent similar issues to happen again, and so on.