Hacker News new | ask | show | jobs
by aanm1988 3420 days ago
They have a billion files in their repo, 9 million are source files.

What the heck is the other 991000000?

I skimmed this. Mostly just stuff any competent company would/should be doing. it's google though, so they act like it's super awesome.

9 comments

Someone checked in a `node_modules` folder by accident.
That would pretty much do it
See this ACM article on it: http://m.cacm.acm.org/magazines/2016/7/204032-why-google-sto...

Lots of things aren't source files: test data, config files, build files, metadata, documentation, etc.

> Mostly just stuff any competent company would/should be doing. it's google though, so they act like it's super awesome.

Yes, you're absolutely correct. But here's the thing - it was actually Google that pioneered many of this. Many of the big/competent companies that are following these practices are because of Google's "DNA" leaking into those companies (via former employees bringing along the best practices learned at Google, etc.)

They may have done a better job instituting these practices across a large organization, and some of their tools have very useful and novel features, but I very much doubt there is a single practice that they actually invented. If you think there is one, please be specific. I think what Google contributed is evidence that these practices can be instituted at scale, which really was sorely lacking in some cases. This helped the industry disseminate them.
Of course it's hard to say if they completely, 100% invented anything from scratch. But they sure did "pioneer" a lot of unique practices that other software companies were not following at the time.

A specific example - the practice of keeping the entire codebase at the company under a single "source" repo. Pre-Google - it would've been considered outrageous to have the entire codebase of a sophisticated software company keep their entire software contents under a single repo. But Google did it, and other companies have followed suit successfully (as Google DNA has leaked to other companies).

Yes, of course keeping code in a single repo is not a "new invention". Linux is a single repo; many smaller companies have only a single repo because their only product is a single web app. Google keeps nearly 100% of their entire codebase in a single repo - and that was definitely a novel approach at the time.

Microsoft used to have the best practices and...they were mostly as good as Google. Everything old is new again.
As someone who worked at both companies for a long time, I assure you that Google's best practices (circa when I switched) were a generation ahead of Microsoft's. Mostly due to MSFT having much longer software release cycles, a more primitive, Windows-based internal cloud, many legacy build systems, less inter-group trust, and little company-wide desire to improve things.
> What the heck is the other 991000000?

Says right in the article: various config and dependency files, presumably both as caches (where everyone would generate the same product) or as a record of where things stood on at time t.

For example:

> In some cases, notably Go programs, build files can be generated (and updated) automatically, since the dependency information in the BUILD files is (often) an abstraction of the dependency information in the source files. But they are nevertheless checked in to the repository.

So basically somebody can write a script to put this Build file in gitignore, save the company millions of dollars, and get promoted for it?
There are other possible annotations in the build file.

You can get an idea of what it looks like by reading the Bazel docs: https://bazel.build/versions/master/docs/be/overview.html

Storing a few text files at Google doesn't cost millions of dollars, BTW.

They don't use git or any other distributed version control system, so there is no incentive to keep it small. And anything outside the source control system isn't accessible to all the tools that use it, so it would introduce complexity.
Then you have to run a tool that processes .go source files in order to perform dependency analysis. Consistency is a virtue.
Nah, because it would cost the company billions of dollars in lost productivity waiting for these files to get re-processed every time someone built the thing. Google's general philosophy is that humans are expensive and computers are cheap, so pretty much anything that helps the humans go faster is a going to be a net benefit in the long run.
People rarely get promoted for saving money.
Finance people do.
Good point. I meant software engineers rarely get promoted for saving money.
This does not apply in this post though.

There are thousands of SWEs working on systems to save money for Google.

They sure do at Amazon. Frugality is one of the explicit leadership principles and initiatives often have cost saving as a primary goal and always as a secondary goal.
It was very eye opening and helpful for me. Given that at our startup we are just starting to grow and trying to set software development processes and standards to help with the growing number of devs, this info provides a good guidance on what to aim for, and also showed me that we are going in the right path in several ways.
Well, I just fired up Android Studio and created a blank app. I ended up with no less than 77 files. Seriously, 77 files for a freaking BLANK app.

That "Hello World" Flask program that was 1 nice cute file? It's about 20 files deployed in Heroku.

Sometimes I wonder if things really need to be this complicated.

No, they don't.
> Mostly just stuff any competent company would/should be doing.

Many companies should be doing this. Few (that I know of) are doing this.

Making data-driven decisions also should be a thing, yet many still make them based on nonsense like politics.

Right, because data leaves no room for interpretation.
No need to be needlessly sarcastic. Data-driven means that you collect various metrics on dev workflow, what slows productivity, or on the product side (user patterns, retention, etc.) and use those when making decisions. Unfortunately, many companies still base their decisions very simplistic metrics and/or on "instinct".
Sorry, it was late and I didn't want to write a more substantive response.

The issue here is that politics are unavoidable. Being more data-driven is just another way of running your political process. And yes, it's a better way as long as you know its limitations. Collecting data and sifting through it to extract useful information takes time, creative thinking, and even "instinct" to figure out the right questions and hypotheses. Furthermore if you're going to collect data on dev workflow you better not have incentives there for employees or they will be gamed.

One of my pet peeves is technical people who worship so strongly at the altar of rationality that they are blind to their own biases. Even the most guileless and logical engineer still has an emotional life and worldview that forms the building blocks of what turns into "politics" when you get a large group of people together.

Agreed! I also hate it when people think their methods are so rational that they represent the ground truth, and are not biased in any way.
> would/should

There are billions of dollars of difference between "would" and "should"

Who is "they" who act like it's super-awesome?

Translation files, xml files, some data files, images, and lots of other things.

Ps: goog employee