Hacker News new | ask | show | jobs
by 01100011 2446 days ago
Monorepo shortcomings 1 and 2 seem like bullshit to me. Perforce, the popular monorepo at most companies I've worked at, supports access control. Monorepos do not prevent you from segmenting your code into modules and pushing binary/source packages into source control so that builds can avoid compiling everything(TiVo used to do this, and it worked well when you got the hang of it).

I feel like these debates are often fueled by false arguments. Either way you go, you're going to want to build support tools and processes to tailor your VCS to your local needs.

5 comments

VCS access control are the wrong tool for solving the "people use code they shouldn't" complaint.

First, VCS ACLs will massively reduce the benefits you're supposed to get from a monorepo. How will you do global refactors in that kind of a situation? How does a maintainer of a library figure out how the clients are actually using it? (The clients must have visibility into the library, but the opposite it unlikely to be true.)

Second, let's say that I maintain a library with a supported public interface that's implemented in terms of an internal interface that nobody's supposed to use. How will VCS ACLs allow me to hide the implementation but not the interface? When they kick off a build, the compiler needs to be able to read the implementation parts to actually build the library. It can't be that the clients have access to read the headers but then link against a pre-build binary blob. At that point you don't have a monorepo, you've got multirepos stored in a monorepo.

The actual solution are build system ACLs. Not ACLs for people, but ACLs for projects. Anyone can read the code, but you can say "only source files in directory X can include this header" or "only build files in directory Y can link against this object file".

VCS ACLs can allow for read-only access. You can also split public interfaces into their own header. If you want the maintainer of a library to be able to refactor clients of the library, then you have to grant them access to the client code. How does a multirepo solve this issue?

> How will VCS ACLs allow me to hide the implementation but not the interface?

If you don't give people access to the code, they can't build it. So what? Publish pre-built binaries from your CI system back to source control.

> At that point you don't have a monorepo, you've got multirepos stored in a monorepo.

I think it's a spectrum. It would be stupid to dogmatically stick to either extreme. You modify things in a pragmatic fashion to solve the problems you're facing. In my experience, starting with a monorepo and making exceptions as needed has worked better than the alternative.

Your post sounds similar to a lot of the multi/mono repo discussions. You've focused on one problem and one way to solve that problem without considering that there are many ways to work around it. Neither approach is going to be pain-free and both require tooling for special scenarios.

Bazel has this via the 'visibility' attribute on packages and build rules: https://docs.bazel.build/versions/master/skylark/build-style...
> VCS access control are the wrong tool for solving the "people use code they shouldn't" complaint.

I agree

> The actual solution are build system ACLs.

Or, maybe, better languages enforcing better design. In most of the cases artifacts and libraries are not related to the domain, engineers create them just to establish artificial boundaries between code components, isolate irrelated things, enforce encapsulation and avoid accidental mixing of metalanguages.

It would be lot better to have a smart compiler for this.

A tool which can prevent us from mixing different abstraction layers, creating unneccessary horizontal links between our components, etc, etc.

I have a couple of ideas how such a thing may look like.

> Monorepo shortcomings 1 and 2 seem like bullshit to me.

It's a blogpost and the author didn't try to build a total and exhaustive formal system. These shortcomings are not absolute truth but actually they are true.

I've seen this multiple times: a small projects evolves over years into a monster. Engineers add new components and reuse any other components they may need creating horizontal links. At some point they feel like they lost their productivity and they blame monorepo because it's easy to create horizontal links in a typical monorepo. So, they try to build a multirepo flow and they spend a lot of effort, time and money trying to make it working. At some point they feel that their productivity is even worse than it was before because now they need to orchestrate things so they merge everything back.

Same applies not only to VCS flows, but to system design as well.

When we discuss monolith/microservices controversy all the monorepo/multirepo arguments may be isomorphically translated to that domain. What is better, monolithic app or a bunch of microservices? A role-based app of course: https://github.com/7mind/slides/blob/master/02-roles/target/...

Monorepo/multirepo and monolith/microservice are orthogonal concepts. When organizations don't understand that then they may end up building a distributed monolith in across multiple repos. (The "Distributed Big Ball of Mud" anti-pattern.)

Monorepo advocates are typically advocating for microservices, but within a single code base.

The way you provide access control is through code review and build system visibility.

In order to modify another group's code you require their approval on the review for that section of the code base. (Using mechanisms like github/gitlab owners files or rules within upsource.)

This still means that if one group needs to make extensive changes to another groups code, the path of least resistance may be to fork it into your own group's section of the repo.

Build tools provide another point of control. If you're using a tool like bazel, the way you link to a component in another portion of the repo is through target names. The only targets your code will have access to are those that the owners has declared as being available for external builds.

> Monorepo/multirepo and monolith/microservice are orthogonal concepts.

Yes and no. In both the cases it's a story about components and their isolation.

> they may end up building a distributed monolith

Yup, seen that many times.

> Monorepo advocates are typically advocating for microservices, but within a single code base.

I'm avocating roles. Everywhere.

> If you're using a tool like bazel

If only Bazel supports Scala well enough...

> If only Bazel supports Scala well enough...

Many companies build their Scala code using Bazel[1]. For example, Databricks wrote about their experience using Bazel on a monorepo containing mostly Scala[2]. Can you share the specific concerns or issues you faced? Thanks.

[1] https://github.com/bazelbuild/rules_scala/blob/master/README... [2] https://databricks.com/blog/2019/02/27/speedy-scala-builds-w...

(Disclaimer: I work on Bazel)

Thank you, I know. Though I need to build ScalaJS (and I have one small Scala Native) project. This is a total no-go for Bazel. Unfortunately.
All of the supposed flaws of a monorepo in this article are actually flaws of git. This is a very common phenomenon. I often joke there are two kinds of developers: those who prefer monorepos and those who have never used perforce.
This is all true BUT I think the monorepo as described here is the act of treating all your projects as directly referencing each other.

Sure you could just use a manyrepo style of dependency tracking in a monorepo but I think that's not exactly what the author is exploring.

> This is all true BUT I think the monorepo as described here is the act of treating all your projects as directly referencing each other.

From what I read that is a correct assessment. What the OP is proposing is something of a strawman argument. No advocate of monorepos I've ever met believe that a monorepo should imply a monolith.

Generally they're advocating monorepos in order to develop microservices faster, and with less effort. Using a monorepo and the associated tooling side steps the pain that comes from complicated CI, the difficulty of sharing code, the difficulties of non-atomic cross-repo reviews, and the difficulties of making multi-app refactorings.

Can you elaborate on “monorepos do not prevent you from checking packages into source control” and how that helps to avoid recompiling everything? Why would you check a package into source control anyway? Surely source control is for source code? And I lean toward monorepos, btw, but there are still lots of obstacles and monorepo proponents don’t tend to acknowledge them or offer clear suggestions for how to solve or workaround them.
You can use something like a shared binary repo such as maven or you could just check in dependencies and not worry about an external server being available for builds.

>Surely source control is for source code?

This is just pedantry. Checking in binaries is a pragmatic solution that solves a lot of problems.

I was rather under the impression that checking in binaries was discouraged because it led to performance issues and tends to blow up the repository size. I don't think it's just pedantry.
I wasn’t trying to be a pedant, I’ve just never heard of anyone doing this. I was wondering how it helped solve the problem of not rebuilding everything.
In short, the binaries are already built. Usually its faster to link to a prebuilt binary than to build from scratch.
So where do these binaries get built and how does the system know which binaries to rebuild for a given change? If developers are building binaries and committing them directly, doesn’t that open up security or even correctness issues? How does this approach satisfy compliance concerns (how can the CTO or a manager sign off on the changes that went into the binary if it’s just something a random developer committed?)? How does this scale to tens of deployments per day? These are hard monorepo problems, and they keep being handwaved away.
Suppose the binaries in question are build tools or similar: then this is good, because they never get rebuilt. The paperwork is done, the binaries get committed to version control, and everybody that builds the code then builds the code with the approved binaries. Everybody is happy.

Suppose the binaries are build byproducts, and people just check this stuff in, like, whatever. Well, if somebody needs to sign off on the output, that's a problem - so that person then doesn't use what's in the repo, but instead builds the output from scratch, from the source code, hopefully with known build tools (see above!), and signs off on whatever comes out.

But, day to day, for your average build, which is going to be run on your own PC and nowhere else, nobody need sign off on anything. If you link with some random object file that was built on a colleague's machine, say, then that's probably absolutely fine - and even if it isn't, it's still probably fine enough to be getting on with for now. If you work for the sort of company that's worried about this stuff, there's a QA department, so any issues arising are not going to get very far.

Overall, this stuff sorts itself out over time. Things that are problems end up having procedures introduced to ensure that they stop happening. And things that are non-problems just... continue to happen.

>So where do these binaries get built and how does the system know which binaries to rebuild for a given change?

For simple things, if the code in a directory changes then the CI system does a rebuild of that directory. You can have the CI system either validate that the binary matches or commit the binary itself. More complicated things you'll have a build system such as Bazel which figures out what changed.

Its really not any different than depending on the exact version in some dependency manager. Instead of just the dependency config you check in the binary. When a dev needs a newer version of a dependency they can pull it down and check it in. You wouldn't check in random nameless binaries, just hard copies of things you would have linked to from a dependency repository.

This doesn't work well for dependencies where you're expected to be using the latest version of something that changes 10 times a day.

The rest of your questions are fairly irrelevant as they would be answered the same way as the in the dependency repo case. ie, use official binaries.

...but this is closer to multi-repo than monorepo. If you're in a monorepo you might as well use the source.

> So where do these binaries get built and how does the system know which binaries to rebuild for a given change?

By the CI. All major CI/CD tools support rules like build binary x whenever a file under x-src/* changes; commit binary x when the ref matches /v[0-9.]+/; don't allow developers to manually push to these refs / paths; (run a script to) bump the dependent x of y whenever binary x changes; merge the bumped version if all tests still pass; etc.

Not sure how people do this in practice. But in principle it seems rather straight forward.

A compiler is just a program that takes some input and create some output. Both the compiler and the input can have a cryptographically secure hash. Putting both in a sealed box, like a docker image, with its own hash, gives you a program that takes no input and produces some output.

If the box changes, run it in a trusted machine and save the output together with a signed declaration of which box version produced it

At Google we check in the source of every library into the monorepo and compile them ourselves with cached builds from a central server, I don't think we use package managers.
You don't have to use a package manager, that's just the approach the TiVo folks came up with a couple decades ago. They use RPM to package independent software modules and check them into (IIRC) a separate build repository which saves the last n months of work. A local config file is used to choose the binary package version to use, or, alternatively, the locally built files to use. They probably could have just made tarballs, since I don't think they used any of the dependency checking.
How do you track dependencies of dependencies. Do you need to manually add the full dependency tree and re implement the dependency tracking through your internal system? If a project uses maven or gradle, you need to rewrite those files to point to your internal builds instead?
Not a Googler, but I think the answer is: yes. At least, it is for my monorepo company.

Usually somebody else has already gone through the work of doing it for you. Sometimes there are tools that do the translation for you. For example, Go modules are quite easy to translate to a BUILD file.

It’s actually not as bad as it sounds. You only have to do the hard stuff once, and every engineer in the org who uses it in the future is thankful for it.

They use a tool called Blaze (Google around for “Bazel” which is the open source tool inspired by it). Basically you model the dependency tree such that the tool knows which targets are affected by a certain change, and then Blaze builds them in a clean room environment such that an undeclared dependency would cause the build to fail (hermetic builds). As far as I’m aware, this is the only way to sustainable operate a monorepo, but I would be happy to learn more if someone has other solutions.
I assume you mean third party dependencies that are not in the monorepo? Pretty much yes, monorepos struggle if they are expected to handle dependencies that aren't stored in the monorepo, so step 1 of using a dependency from outside of a monorepo should be to copy the source into the monorepo (and transitively copy the source of dependencies, etc).
Full dependency tree yep. No build in google's main repo ever retrieves code externally.
It's version control, not necessarily just source control! If something could benefit from being versioned, why would you not check it in? You then guarantee everybody has the same version. That's exactly what this thing is there for.

Git's design can limit its usefulness in this respect - though perhaps you could solve this to some extent with git LFS? - but not all version control systems have this problem.

git annex (or git LFS, if you buy into github's NIH) is requisite if you want to use git like this, broadly. git will happily store any and all binaries you ask it to, but upon (blind) checkout, it will grab every single revision of said binary, taking up as much however much space that takes.

(partial clones avoid this, but, as git isn't designed for this use case, grabbing all of history happens far too easily.)