Hacker News new | ask | show | jobs
by shrewduser 2368 days ago
I'll never understand the fascination with mono repo's.
6 comments

Once you reach a certain size of codebase, you're either going to be investing significantly in making many repositories work together and look a bit like a monorepo, or you're going to be investing significantly in making working on individual parts of a monorepo more efficient and look a bit like an isolated repo.

Both approaches take a huge amount of work and tooling.

The big selling point of a monorepo is that the time and effort taken to follow strict versioning and upgrade discipline for multiple interdependent projects can be somewhat avoided. On the code side.

If you're looking for a magic bullet argument proving that either approach is strictly better, I'm not the person to ask.
At a certain size, monorepo becomes the worst way to do it except for all the others.

Essentially: version skew across numerous artifacts in a large organization starts to look like the version skew across an industry or ecosystem. The aggregate cost of dealing with it project by project is probably higher, at least that is what most of the biggest tech companies have concluded, than dealing with it at the source level using a monorepo and single-version policy.

Well, don't have version skew then? Require that anything merged to master doesn't break any tests? Require that tests exist in the first place? Google makes it work at a dramatically larger scale. Everything at tip-of-tree is always ready to go.

EDIT: Looks like I've misread the parent's argument as one against monorepo. It was in fact an argument in favor, and one I agree with.

Yeah but Google does that by being a monorepo.
Looks like I've misread the parent's argument as one against monorepo. My error.
Well for one you can commit to multiple projects in a single PR. Makes coordinating changes across projects much easier.
It gives you that illusion; it doesn't solve versioning and deployment orders, and I'd argue that that's the harder part of changes across projects. Polyrepos make messy things...messy.
Deployment ordering at large scale is avoided and usually done by not making breaking changes. 4 phase migrations, always. Roll out new API, update existing software to use new API, wait for everything to stop using old API + backfill, remove old API.
I agree that gradual adoption of new APIs is the way to go, but once you're doing that you no longer need an atomic commit across all projects.
You actually never want an atomic commit for that class of changes across projects because HEAD should always be deployable to all services. It's obviously messier at FAANG-scale, but with even 25 devs, not properly staging API-breaking changes leads to a lot of "only deploy commits before xxxx to service foo."
It pretty much does solve the versioning issue. “Latest, always”. The downside is the abysmal state of monorepo build tools. With multirepos, who updates the downstream repos’ dependency files (e.g., requirements.txt) when an upstream project releases a change? And is the policy “latest, always” or do you support N versions of every package? I would argue that the latter is insane at any scale, and the former leaves you dealing with dependencies manually (someone is updating the downstream repos’ dependency files when an upstream change is released) or you build automation that does it and you’re well on your way to implementing your own monorepo-like build tool.

Everything is hard, unfortunately.

Oddly, this is also one of the bad sides. Committing to two projects, by necessity, means deploying to two workflows. If not more.

Doing that in one repo makes the commit part easier, but hides the complexity of deploying separately. Or to other places.

Not that two repos makes it easy. Just gives a much earlier signal to where it happens.

Or you can have a single workflow that includes all the projects in the repo. I found it's actually easier to do things like wait for project A to deploy before project B.
Only if the safe deployment order is always the same. In any typical server-client deployment, breaking changes can go in either direction, and which one you can deploy first requires some thought. I've seen 3- or 4-stage deployments for some back-and-forth changes.

In my experience, you're required to break changes up into safe individual deployments anyways, so the monorepo doesn't add any benefit in that sense.

There are tradeoffs both ways. With multirepos you likely have a dependency hell problem and you often have to submit and release several PRs for otherwise small updates. With monorepos, (if you want reasonable build times) you have to be able to determine what has changed and what needs to build (including tests, etc) as a result. This is technically true of multirepos as well, but the problem is pushed into git and manual process.

Having looked seriously at both options, I think the monorepo world is the right one, but it presently lacks good tooling to sanely model your dependency graph AND create custom build rules while still being affordable for small or medium-sized orgs. Git/hub simply isn’t designed for this kind of modeling and everything I’ve seen built atop it is either way too manual or a kludge. Maybe the “kludge” solutions are actually reasonable, but my confidence is low.

Bazel is the right idea, but it’s execution disappoints. The documentation is abysmal, last I checked they advertised Python 3 support, but it’s been broken for years with no signs of progress. Building custom rules also looked hopelessly complex (by which I mean, “not something our organization can afford to implement and maintain”) but maybe there’s some undocumented happy path that I’m missing out on? These things seem easy enough to implement. We’re using Pants right now, and for it’s many similar problems (bugs, documentation, poor code base, difficult extensibility), it at least does a passable job at building Python projects.

I’ve thought about it a fair amount, and I think it’s reasonable to build something simpler that might not meet Google’s use case, but would at least enable small and medium sized shops to play the monorepo game.

rules_python has supported py3 for a while.

The next obvious question is, what would you do to make it simpler? Tons of people have tried (you listed 5), and they all rebuilt the same thing. What features do you drop?

Last I checked (maybe 6 months ago), it definitely _didn't_ support py3, although it was advertised. I thought I was doing something wrong, but there were half a dozen issues in the tracker that indicated it was critically broken.

I understand that "it should be simpler" is a pretty lazy criticism. It's been a while since I audited Bazel and friends, and I've forgotten which issues apply to which tool. Moreover, because of the awful state of the documentation and the messiness of the code base (or perhaps this is just standard quality for Java projects?), it's really difficult to tell whether any given issue is actually a fundamental shortcoming in the application or whether it's simply a knowledge gap.

As far as what I want, keep the starlark configuration file format; implement all rules as starlark libraries (such that no one needs to write Java to extend, and if you must write Java then for goodness' sake fix the plugin interface or document it better or something such that one doesn't need to be a core contributor to implement a plugin--perhaps this is fine for an enterprise audience, but it's not fine for my use case). The rules should call into a base `mktarget()` or similar that takes args like the target's ID (the package:target_name pair), a target type that identifies the code used to build the target, and a dict of args/params that are passed into the aforementioned code. The args/params can be an arbitrarily nested JSON-like type so long as the leaves are primitives (int, string, etc), references to source files, or other targets and all leaves (and transitively, the whole structure) must be hashable such that we can identify a given execution of the build.

Beyond that core operating model, the code and the user interface should be clean and well documented. Ideally, small and medium-sized projects shouldn't need to run it in daemon mode to get reasonable performance. This is important because a daemon running on local development machines introduces a larger maintenance burden (there's just more that can go wrong). Language-specific plugins (custom rules, whatever you want to call them) should adhere pretty closely to the conventions of the target language. Lastly, there should be good support for building toplevel artifacts--this means I should be able to build a whole CloudFormation package including lambdas, Docker images, etc just like I would build a JAR or a C++ binary.

I realize that those things are easy enough to say, but the devil is in the details. I've actually gone so far as to prototype the implementation, so I'm confident that those goals are achievable. Unfortunately, it's a pretty significant effort (mostly due to the breadth of project types/languages to support and the nuance/expertise required to support any of them), so I'm bound by free time. If anyone is interested in collaborating or discussing more in-depth, hit me up on Twitter @weberc2 or email me (username at gmail.com).

As of April of this year, python3 was the default for python rules in bazel.
Good to know. Hopefully it works now.
cross building is even worse with many repos. I've been there, done that and it broke so often. now we have everything in one repo and we barly have problems. btw. we are a small shop with less than 5 people, but have a product on metal that requires multiple services (that sometimes interact with each other)

we don't use bazel (yet), because dotnet is not that supported.

I am sure you will when you will end up working in a huge organization with intricated and heterogeneous projects/teams interdependencies.

You will soon experience:

- dependencies hell due to transitive and conflicting dependencies

- one back-incompatible change in some obscure library end up breaking some other unknown service that happens to transitively depend and it

- the entire codebase will become a mess due to inconsistent code styles and formatting because hey we are developers and we can never agree on anything. Thus each team lead will have its own opinion

- each team will have to maintain its own CI/CD jobs

- heterogeneous builds: maven, node, sbt, webpack, etc ...

the list goes on ...

All (or most of) this mess is solved by centralizing the codebase in a monorepo.