Hacker News new | ask | show | jobs
by kentt 3427 days ago
It's disappointing that all the comments are so negative. This is a great idea and solves a real problem for a lot of use cases.

I remembering years ago Facebook says it had this problem. A lot of the comments were centered around that you could change your codebase to for what git can do. I'm glad there's another option now.

5 comments

Yes they did. They choose to scale out Mecurial to solve their problem. Wonder if they still use Mercurial?

https://code.facebook.com/posts/218678814984400/scaling-merc...

Both Facebook and Google are continuing to contribute to Mercurial, so they both have some vested interest in it. If you poke around the commits on the repo[0] you'll see commits from people with @fb.com and @google.com email addresses. The mailing lists also has activity from both companies still.

As well, the Mercurial team does quarterly sprints (I believe), and Google is hosting the next one[1].

[0] https://www.mercurial-scm.org/repo/hg

[1] https://www.mercurial-scm.org/wiki/4.2sprint

Sprints are twice a year, once in the US and once in Europe.
They do. Durham Goode (Tech Lead on Source Control at Facebook) just held a talk at Git-Merge about how they scaled Mercurial at Fb. They seem to be quite happy with it, albeit applying quite a few restrictions on their internal users that are not really transferable to the general (outside-corporate) usage of VCS (for example only rebases are allowed, directly committing to master all the time, etc.)
That's actually pretty comparable to how we tend to operate the Mercurial project, FYI. We tend to prefer rebase to merge for feature work.
Do you use Changeset Evolution?
Very small fraction of FB engineers use Changeset Evolution.

We have new workflows based on some of the underpinnings of Evolution, but without the UI confusion.

They did a couple of months ago so I assume they still do.
Also don't think that this is a good idea. Git is a Distributed Version Control https://en.wikipedia.org/wiki/Distributed_version_control, the main benefit of which is "allows many software developers to work on a given project without requiring them to share a common network". Seems like with GVFS they are making DVC to be a CVS (https://en.wikipedia.org/wiki/Concurrent_Versions_System) again. What is the point? There are a lot of good CVS systems around. They just to give cool kids access to cool tools? I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

At Splunk we had the same problem, our source code was stored in CVS (perforce), but we wanted to switch to git. And not only because we really wanted to use git, but to simplify our development process, mainly because of the much easier branching model (lightweight branching also is available in perforce, but to get it we still needed to do some upgrades on our servers). We also had a problem that at the beginning we had very large working tree, don't think it was 200-300Gb, I believe it was 10x less, and actually required 4-5 seconds for git status. This was not appropriate for us, so we worked on our source code and release builds to split it in several git repos to make sure that git status will take not more than 0.x seconds.

My point is use right tools for right jobs. 4-5 seconds for git status is still a huge problem, I would prefer to use CVS instead if that will not require me to wait 5 seconds for each git status invocation.

> I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

How many of them have you used? I've used a couple, to interact with large code bases on the rough order of 300GB. In my experience they don't work very well, because you have to be hygienic about the commands you run or some part of your Git state gets out of sync with some part of your state for the other source control system. So I gave up on those, and I use something similar to Microsoft's solution at work on a daily basis. It's a real pleasure by comparison, and in spite of that I still call myself a Git fan (about 10 years of heavy Git use now). At work the code base is monolithic and everyone commits directly to trunk (at a ridiculous rate, too).

I've heard horror stories about back when people had to do partial checkouts of the source code, and I'm glad that the tooling I use is better.

The idea of breaking up a repository merely because it is too large reminds me the story of the drunkard looking for his keys under the streetlights. The right tools for the right job, sometimes you change the job to match the tools, and sometimes you change the tools to match the job.

A bit of a topic highjack, but I'm always curious how "many people committing very often on the same branch" works in practice. I'd expect there to be livelock like scenario's.

Do you need to do anything special? Or is this just a non-issue? Doy you push to master or do you use some sort of pull request gui (like github or phabricator)

Not sure what livelock you're talking about. If you have a bunch of people pushing to Git master, sure, you'll get conflicts and have to rebase (if we're talking about TBD). But the conflicts are always caused by someone successfully pushing to master, so some progress will always be made.

But I was always just using Git as an interface to something else, usually Perforce or something similar. When pushing with these tools, you'd only get conflicts if other people changed the same files. Git was just used to create a bunch of intermediate commits and failed experiments on your workstation, which is something that it really excels at.

The only real problem is when the file you're changing is modified by many people on different teams, which often means that it's used for operations, and when that becomes a bottleneck it'll get refactored into multiple files or the data will be moved out of source control.

I used one for TFS, when I worked in Microsoft, and git p4 when I worked at Splunk. Certainly enjoying that we are 100% git now.

My point was that with GVFS they are not really solving the problem they had - git status still takes 4-5 seconds, to be that is a lot.

So you're saying that GVFS isn't a good idea because it's not good enough for working on the Windows repository?

Well, yeah. It's pre-production, and let Microsoft worry about their own problems anyway. But it sounds like GVFS will be killer for people who have large repos that aren't as large as the Windows repo. Even if 4-5s for the 270GB/3.5M file repo is too long, 400-500ms for the 27GB repo is fantastic.

At some point you ask yourself, "Would I split this repo if my tools could handle the combined repo just fine?" If the answer is no, then you're going to be happy that the tools are getting better at handling big repos. Microsoft's choice to exploit the filesystem-as-API and funnel all filesystem interaction through the VCS is a smart choice and there are a ton of opportunities for optimization that don't exist when you're just writing to a plain filesystem.

> Seems like with GVFS they are making DVC to be a CVS again. What is the point?

It sounds like they answered that:

> In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.

Source will still be distributed among the developers that touch it. Seems like a decent compromise.

I'm curious to dig a bit further in, but from the blog post I get the impression that they are also still cloning the full commit history, just not the full file trees attached to the commits and definitely not the full worktree of HEAD, leaving those to be lazily fetched. If that is the case, that sounds like an interesting compromise on the git model and something verging on some of the speculative ideas I've seen about using something like IPFS to back git's trees, to the point where maybe you could use something IPFS in tango with this and have a good DVCS solution.
Based on the protocol https://github.com/Microsoft/gvfs/blob/master/Protocol.md#ge... don't think that "still cloning the full commit history" - this is true.
> Seems like with GVFS they are making DVC to be a CVS

> just to give cool kids access to cool tools

Yes. DVCS with the huge code bases, large binary objects and large teams is hardly the optimal approach. But the "cool kids" are just used to use what they use. And now they can pretend to do it even when they have to be always connected, because the files are virtual and remain on the server until really used.

If Microsoft is giving the solution to the "cool kids," no reason to complain about the fact that Microsoft is willing to care for them.

And if you'd ask the "cool kids" why do they need git at all for such scenarios, have fun with the amount of arguments you'll get. Why this one "needs" vi and another "Emacs" etc. The same reasons. You'll find the arguments also in the comments here. Including mentions of Mercurial, the competition, just like "vi or Emacs". Because. Don't ask.

And no, as far as I understand, Google doesn't primarily "use Mercurial", they use something called Piper, and before they used a customized Perforce just like Microsoft did.

https://www.wired.com/2015/09/google-2-billion-lines-codeand...

"Piper spans about 85 terabytes of data" "and Google’s 25,000 engineers make about 45,000 commits (changes) to the repository each day. That’s some serious activity. While the Linux open source operating spans 15 million lines of code across 40,000 software files, Google engineers modify 15 million lines of code across 250,000 files each week."

It is not clear from the announcement nor the code, but in principle, I don't see a reason that it can't be a DVCS.

Sure, GVFS downloads files only when first read; but maybe it keeps them cached? Maybe you can still work on them and commit changes after you get offline? At least in principle, nothing prevents that.

I was actually surprised that there was only as much negative sentiment as there is. Microsoft could cure cancer and the post to HN would be mostly negative. It's tribal. It doesn't even matter what they do at this point.

That being said, you can see more and more people getting off the "Microsoft is evil" train. It's super slow and every bone headed thing that Microsoft does resets the needle for lots of people.

I've always been surprised how much sympathy a company like IBM or Intel gets on HN. They both sue people over patents. That both contribute to non-free software. They were early backers of Linux, though, and that is what people care about superficially.

To be honest, I was pretty neutral about MS, for a long time now, carefully optimistic even: IE8 was fair enough (when it was new), Win8 was kinda okay, Azure is great...and just when you think they're a normal company, they take out the old guns and start shoving (first GWX and then) WinX down people's throats, never mind any consent.

So, I'm very, very, very sorry that I can't hear their words over the noise of their actions; and in the light of this, I eye each new gift-bearing Redmondian with suspicion.

to be fair, i cant say that i care if people are fair to a multinational corporation. whether linux fans are right or not they are still only doing whats best for their bottom line. should a company get a trophy for doing what its customers want?
I don't agree with the sentiment that doing something that benefits lots of people should be dismissed on the grounds that it was mutually beneficial.
that's a good point. its funny, though, that they have actually started doing a lot of things for PR purposes that i van only imagine that most of their customers couldn't really care less about.

for example, the majority of their money still comes from windows and office, but open source and hologram BS impress the most vocal anti-MS voices in the media.

my point, though, is that there are other companies that dont draw nearly as much ire that engage in the exact same practices. i think, that early antagonism between MS and Linux users has become a tribal signifier for some people. Microsoft people used to have the same kind of relationship with IBM. They also kept flogging that longer than it really made sense...just like linux and mac fans.

HN is at times astonishingly driven by brogrammer conventional wisdom. Look at all of the "why I'm ditching the Mac because I totally need a laptop with 64GB of RAM" stories that got posted after the latest MacBook Pro got introduced. Amoung the "creative" in New York there's the phenomenon of "why I left NYC and moved to LA" stories that some people—specifically dumb people—think are somehow representative of the zeitgeist.
yeah, microsoft fans used to think that IBM was literally the devil. turns out they were just another inept global company schlepping its way through history. "microsoft fans" isnt something that you hear that much anymore.
This is just more of their embrace, extend, extinguish campaign. This is the extend part.
yep. you got it one, wiley coyote.
> This is a great idea and solves a real problem for a lot of use cases.

I don't know if "a lot" is the right qualifier. Solitary repos of millions of files have scalability problems even outside the source control system (I mean: how long does it take your workstation to build that 3.5 million-file windows tree?)

A full Android system tree is roughly the same size and works fine with git via a small layer of indirection (the repo tool) to pull from multiple repositories. A complete linux distro is much larger still, and likewise didn't need to rework its tooling beyond putting a small layer of indirection between the upstream repository and the build system.

Honestly I'd say this GVFS gadget (which I'll fully admit is pretty cool) exists because Microsoft misapplied their source control regime.

It's because the 'problem' it solves is a corner case that's rarely encountered. I love their absurd examples of repos that take 12 hours to download. How many people have that problem, really?

All they did is create a caching layer.

   How many people have that problem, really?
An easy lower bound is 10s of thousands of engineers : developers at several large tech companies (e.g. MS, facebook, google, ?)
If you deal with code, the case is marginal for you.

If you deal with graphics, audio assets, etc, the binary-blob type of data, the case is central.

This is about code, and code history. Just insane volumes.
Well it's a problem for thousands of employees of Microsoft, isn't it? We've had much smaller repository (10GB IIRC) and it really was annoying how long everything took, even with various caches and what not enabled.
"I don't have this problem, so nobody does."

Lacking support for large binary blobs is, like, THE #1 reason that an engineer might have to use an alternative.

Ok, but you'll encounter similar git limitations with repos several orders of magnitude smaller than that too.

All you need is several hundred engineers and your monorepo becomes unwieldy for git to handle.

It's not a caching layer, it's lazy evaluation.