No need to checkout a terabyte of code. If your repo is scaling that high, you're going to want a VFS layer. Microsoft made a VFS layer for Git. As you might imagine, you simply grab files as needed, and your version control just deals with diffs for the most part. Google's own monorepo is proprietary but the Bazel build system is open source and would work great with a VCS hooked up with a VFS layer.
I want to like Bazel. I really do. But on first encounter the syntax is filled with sigils that don't seem to have obvious differences or purpose for existence. Then it turns out that I and others have spent as much time fighting it as using it. Lastly the coverage of ecosystems is sparse and there does not seem to be a lot of activity around extending them -- doing the boring, tedious, unloved work of dealing with everyone's quirks and bugs and corner cases and annoyances (been there, done that).
Again: I wish it was a smooth experience. Because I like the ideas very much. But it wasn't when I tried and I don't know anyone -- outside of Google -- for whom it was a smooth experience.
I can’t speak to the actual implementation, but I’m surprised at your description of the syntax as “filled with sigils”, as the syntax is basically Python -- isn’t that about as easy as you can get?
I find Bazel’s syntax much easier to deal with than other build languages that use JSON (essentially the same Python syntax but with lots of extra quotes everywhere and extra fussiness about where commas are allowed).
Completely valid concern not to want to keep memorizing mini-languages.
In this case, the double slashes are absolute "paths" relative to the top of the workspace, and the part after the colon is a relative "path" to another Bazel target.
I put "paths" in quotes because these are meaningfully different from the true filesystem equivalents; avoiding confusion with real absolute and relative filesystem paths is probably why they made their own syntactic mini-language.
[The sibling reply to mine, referencing Piper and Perforce, goes into a bit more detail on the specifics and the origin of the // prefix.]
What would the better way have been for them to do this?
> What would the better way have been for them to do this?
I don't know, off the top of my head (having been on the other side of this conversation, I am aware how frustrating that answer is). But I know I couldn't keep it straight when I was fighting Bazel and that I gave up. And anecdotally I am not alone: I have seen Bazel torn out of multiple projects, sometimes quite painfully.
Piper, google's source control system has roots in Perforce. In perforce, depot roots are starting with //
The ":" is a bit different, e.g. just "//lib" means "//lib:lib" - e.g. points to the "lib" target in /lib/BUILD file, while "//lib:hello-time" points to "hello-time" target in /lib/BUILD file. So not having the ":name" in "//dir:name" means name="dir" - e.g. "//dir:dir" - at first this is strange, but then you get used to it. Your default target is named after the folder it's sitting in.
It is not a smooth experience outside of Google because the truth is bootstrapping a proper Bazel setup is not actually that easy. If you want hermetic builds for real, you need a hermetic build environment. Bazel tries to accomplish this with a workspace setup in each repo, but unfortunately it's definitely limited and imperfect.
The Bazel rules for languages is also not perfect imo. Like I dislike hooking Bazel up to tools like NPM and Webpack. I'd rather have a system that could sync NPM modules into third_party automatically and setup Bazel files for them, then have a bundling system that is native to Bazel that allows taking full advantage of it's caching and pure building.
Bazel is imperfect on Windows as well. I have tried to help but admittedly it is hard work and it'll take time. I wanted to get Bazel Watcher working on Windows, but my PR is stalled because the Windows API is very truly quite maddening at times. (Feel free to find the PR, it's almost hilarious how convoluted it is to effectively kill a tree of processes. Linux of course is imperfect here but it lets you get 95% of the way Much easier.)
However, here's what I will say: if you are in an organization, I think Bazel really shines. If you can take time to write some custom tools and rules and really integrate your software into Bazel, it can be an awesome experience. Sadly the publicly available rules try pretty hard to match existing semantics and fall short of showing off how nice Bazel can be in some cases, but I think C and C++ is a great area where Bazel shines above the pack.
Another plus: it is Amazing having a build system that crosses languages. Does your Python script depend on a C module and connect over TCP to a Go program? No problem, all of that is easy to express. Do you want to have a Go script that writes a TypeScript file that gets compiled and bundled into your apps JS bundle? Once again this is all fairly natural and you can easily accomplish it with a simple combination of normal build rules and a genrule.
And Starlark is a reasonably complete almost-subset of Python, so it's easy to compose, extend and refactor your rules. If you want to generate a matrix of targets for say, testing across browsers and platforms, you can do that, and make it reusable too.
Basically my advice with Bazel:
- Check out how well it works with C and C++, and I think Java also works quite well. This should give you an idea of how it looks when done right.
- Don't constrain yourself to what Bazel offers in terms of rules. Starlark is hugely powerful and you can easily make your own rules for things.
P.S.: the weird path syntax is probably many parts legacy, but it's not actually super hard to understand. When you see a colon, the left side of the colon is a path to a folder, and the right side is a target name. When you see double slashes, it means absolute path relative to root of workspace. If the colon is omitted the target name is assumed to be the same as the folder name.
//:base -> the base target in the BUILD file in the root of the workspace
//base -> //base:base -> the base target in the BUILD file in the base folder relative to the of the workspace
//app/ui:tests -> the tests target in the BUILD file in the app/ui folder relative to the workspace root
:genfile -> the genfile target in the BUILD file in the current directory
There is some context sensitivity about how to refer to files versus targets and whether you're referring to runfiles, output files, or build files, but most of the time it's surprisingly obvious actually. When it comes to files versus targets, it largely works a bit like Make except there's namespacing for input files vs output files (and runfiles, but that's another topic.)
There is also an @ syntax used to refer to paths outside the current workspace. It mainly comes into play when importing rules.
> However, here's what I will say: if you are in an organization, I think Bazel really shines. If you can take time to write some custom tools and rules and really integrate your software into Bazel, it can be an awesome experience. ... Another plus: it is Amazing having a build system that crosses languages.
This is pretty much what I think of when I want to like Bazel. I wish we had it on Cloud Foundry. Or, rather, I wish it had existed 5 years ago and had been used on Cloud Foundry from the beginning, because CF and its associated projects have hundreds of repositories and these have mostly been kept in sync through mountains of tests and oceans of automation. It works, but I know that in another universe it works better.
I would say it is likely that the lack of a native C++ build tool helped Bazel to not have to compromise on how it integrates compilers into the system. I think that C++ is also just a good fit for the design; not all languages will. Interpreted languages fit into the system a bit less well in my opinion (but I still like that it is treated with some level of consistency.)
If a mono-repo has a terabyte of code, or if 10 small repos have 1/10th a terabyte each, what have you really gained? In any case, git LFS solves large file storage effectively, as do a number of other artifact storage solutions, and a repo with a terabyte of code is _not_ going to be trivially split apart, since it would be by a factor of thousands, the biggest codebase ever created by humankind.
If I only need to check out one of the smaller repos then I've gained quite a lot in terms of download speed, storage size, etc. Git LFS adds a lot of complexity I'd rather avoid.
Sure but then you only have some small portion of the total infrastructure, which adds its own layer of complexity for the people reviewing your changes :P It's all trade offs, is all I'm saying - I honestly still can't decide between the two, although for all companies sub 20 people, I'd for sure stick with a single repo.
If I'm working on Application X, wtf do I care about infrastructure code? Or for that matter, as a specific... if someone is working on Google Maps, should they care about the codebase for Google Inbox for Android?
You maybe relying on shared component for your app, you simply put in your BUILD bazel (blaze) file deps reference to it - e.g. "//base:something", but now that "//base:something" might itself rely on other deps, but that should not be of your concern.
So - what's stopping you from depending (using) anything else? Or how to stop you from doing this? BAZEL (blaze) has visiblity rules, which by default are private - e.g. the rules in your packages are hidden, unless explicitly made public, or alternatively you can white-list one by one which other packages (//java/com/google/blah/myapp) can include you back.
Let's say there is a new cool service, and your team wants to try it out... but it's not out there for everyone to use, it's in alpha, beta, whatever stage. So you ask for permission from the team, or simply create a CL with your package target, name, "..." folder resolution so that you are whitelisted - eventually you will (if that's good idea, and approved). For example you want, if some library got deprecated, and has been slowly replaced with another, and then now instead of being "//visibility:public" is just white listing the last users of it... Well probably not good idea to be added on that list, as the whole thing is going out soon (yes, Google tends to deprecate internally even faster than externally - ... which is good!). But such mechanisms are helpful in getting this worked correctly.
Does Application X rely on particular infrastructure configuration? Or does Google Inbox on Android integrate with Google Maps?
There are dependencies everywhere. Monorepos are one of the tools which can be used to make dealing with them easier in some cases. They’re not an absolute solution not appropriate for all circumstances, but no tool is!
> If a mono-repo has a terabyte of code, or if 10 small repos have 1/10th a terabyte each, what have you really gained?
If it's a small company where every developer touches every part of the application, sure. Taking the FAANG approach if you're not part of that acronym sounds like introducing inefficiency.
If it's a "small" company then I'd expect that one Git repo would do just fine for all or at least most of the code. When I think small, I think ~10 or 20 developers. If you have reasonable hygiene about things like keeping binaries out of your Git repo (excluding consideration of e.g. LFS here) then the whole repo size will stay fairly reasonable. As long as you have one or two Git mavens on your team it should be dandy.
I'd expect to see problems with this approach once you get into the 100s or 1000s of developers. The tooling for this scale of repository isn't as mature.
Isn't the entire argument about the current (or maybe "immediately foreseeable") state of tooling? We don't really care one way or the other, in a philosophical sense. What works?
When the tools aren't good enough, we can either toss up our hands and say "I guess it's always going to be like this!", or we can get to work and make better tools.
This is an argument about how to use current tools. TFA doesn't argue that mono will be great once we work really hard. It argues that mono is great now. Thread parent has a specific objection to that argument. You don't reasonably counter that objection with statements about morality.
- The article spoke about points that were largely independent of the current or future state of tooling. Instead, it focused on fundamental issues with mono- vs poly-repo systems. Most directly, being forced to fix migrations and incompatibilities immediately rather than letting versions skew.
If you want to batter someone for not arguing for or against the points in the article, you can do it with the comment I was replying to, or with your own comment just now.
In the post yesterday one of the arguments was that if nobody checks out all of the code then what's the value of having the code all in one place?
Last monorepo I worked on, individual contributors checked out just the tree they were working on (we had a suite of applications with several shared modules). We made it simple and straightforward for them to get what they wanted and ignore people whose work didn't impact them.
But the senior people, who were better with architecture and version control trivia, checked out the entire thing. They would steward any cross-cutting changes that needed to be done, and make sure any callers to shared libraries were updated in the face of breaking changes. They were also backstopped by the build plans, (some of) which also checked out the entire thing.
Streams aren't modules -- they're views. If someone takes you as a dependency and wants you to have visibility on them they add themselves to your stream so you pull down their directory as well.
But imagine the increased productivity of your devs if they only had to check out a single repo. Anyone has the same organization of projects on their machine. All tools are in one place...
A: You avoid issues such as Readme files stating, "before compiling you have to git clone ../commonA, ../commonB". These always tend to get stale so in reality you also have to git clone ../commonC wasting you tons of hours of troubleshooting.
B: Developer working on daily basis in component A finds a bug in component B. He just has to change the code and commit it for review, instead of understanding the specifics of working with component B repository.