Hacker News new | ask | show | jobs
by outcoldman 3427 days ago
Also don't think that this is a good idea. Git is a Distributed Version Control https://en.wikipedia.org/wiki/Distributed_version_control, the main benefit of which is "allows many software developers to work on a given project without requiring them to share a common network". Seems like with GVFS they are making DVC to be a CVS (https://en.wikipedia.org/wiki/Concurrent_Versions_System) again. What is the point? There are a lot of good CVS systems around. They just to give cool kids access to cool tools? I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

At Splunk we had the same problem, our source code was stored in CVS (perforce), but we wanted to switch to git. And not only because we really wanted to use git, but to simplify our development process, mainly because of the much easier branching model (lightweight branching also is available in perforce, but to get it we still needed to do some upgrades on our servers). We also had a problem that at the beginning we had very large working tree, don't think it was 200-300Gb, I believe it was 10x less, and actually required 4-5 seconds for git status. This was not appropriate for us, so we worked on our source code and release builds to split it in several git repos to make sure that git status will take not more than 0.x seconds.

My point is use right tools for right jobs. 4-5 seconds for git status is still a huge problem, I would prefer to use CVS instead if that will not require me to wait 5 seconds for each git status invocation.

4 comments

> I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

How many of them have you used? I've used a couple, to interact with large code bases on the rough order of 300GB. In my experience they don't work very well, because you have to be hygienic about the commands you run or some part of your Git state gets out of sync with some part of your state for the other source control system. So I gave up on those, and I use something similar to Microsoft's solution at work on a daily basis. It's a real pleasure by comparison, and in spite of that I still call myself a Git fan (about 10 years of heavy Git use now). At work the code base is monolithic and everyone commits directly to trunk (at a ridiculous rate, too).

I've heard horror stories about back when people had to do partial checkouts of the source code, and I'm glad that the tooling I use is better.

The idea of breaking up a repository merely because it is too large reminds me the story of the drunkard looking for his keys under the streetlights. The right tools for the right job, sometimes you change the job to match the tools, and sometimes you change the tools to match the job.

A bit of a topic highjack, but I'm always curious how "many people committing very often on the same branch" works in practice. I'd expect there to be livelock like scenario's.

Do you need to do anything special? Or is this just a non-issue? Doy you push to master or do you use some sort of pull request gui (like github or phabricator)

Not sure what livelock you're talking about. If you have a bunch of people pushing to Git master, sure, you'll get conflicts and have to rebase (if we're talking about TBD). But the conflicts are always caused by someone successfully pushing to master, so some progress will always be made.

But I was always just using Git as an interface to something else, usually Perforce or something similar. When pushing with these tools, you'd only get conflicts if other people changed the same files. Git was just used to create a bunch of intermediate commits and failed experiments on your workstation, which is something that it really excels at.

The only real problem is when the file you're changing is modified by many people on different teams, which often means that it's used for operations, and when that becomes a bottleneck it'll get refactored into multiple files or the data will be moved out of source control.

I used one for TFS, when I worked in Microsoft, and git p4 when I worked at Splunk. Certainly enjoying that we are 100% git now.

My point was that with GVFS they are not really solving the problem they had - git status still takes 4-5 seconds, to be that is a lot.

So you're saying that GVFS isn't a good idea because it's not good enough for working on the Windows repository?

Well, yeah. It's pre-production, and let Microsoft worry about their own problems anyway. But it sounds like GVFS will be killer for people who have large repos that aren't as large as the Windows repo. Even if 4-5s for the 270GB/3.5M file repo is too long, 400-500ms for the 27GB repo is fantastic.

At some point you ask yourself, "Would I split this repo if my tools could handle the combined repo just fine?" If the answer is no, then you're going to be happy that the tools are getting better at handling big repos. Microsoft's choice to exploit the filesystem-as-API and funnel all filesystem interaction through the VCS is a smart choice and there are a ton of opportunities for optimization that don't exist when you're just writing to a plain filesystem.

> Seems like with GVFS they are making DVC to be a CVS again. What is the point?

It sounds like they answered that:

> In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.

Source will still be distributed among the developers that touch it. Seems like a decent compromise.

I'm curious to dig a bit further in, but from the blog post I get the impression that they are also still cloning the full commit history, just not the full file trees attached to the commits and definitely not the full worktree of HEAD, leaving those to be lazily fetched. If that is the case, that sounds like an interesting compromise on the git model and something verging on some of the speculative ideas I've seen about using something like IPFS to back git's trees, to the point where maybe you could use something IPFS in tango with this and have a good DVCS solution.
Based on the protocol https://github.com/Microsoft/gvfs/blob/master/Protocol.md#ge... don't think that "still cloning the full commit history" - this is true.
> Seems like with GVFS they are making DVC to be a CVS

> just to give cool kids access to cool tools

Yes. DVCS with the huge code bases, large binary objects and large teams is hardly the optimal approach. But the "cool kids" are just used to use what they use. And now they can pretend to do it even when they have to be always connected, because the files are virtual and remain on the server until really used.

If Microsoft is giving the solution to the "cool kids," no reason to complain about the fact that Microsoft is willing to care for them.

And if you'd ask the "cool kids" why do they need git at all for such scenarios, have fun with the amount of arguments you'll get. Why this one "needs" vi and another "Emacs" etc. The same reasons. You'll find the arguments also in the comments here. Including mentions of Mercurial, the competition, just like "vi or Emacs". Because. Don't ask.

And no, as far as I understand, Google doesn't primarily "use Mercurial", they use something called Piper, and before they used a customized Perforce just like Microsoft did.

https://www.wired.com/2015/09/google-2-billion-lines-codeand...

"Piper spans about 85 terabytes of data" "and Google’s 25,000 engineers make about 45,000 commits (changes) to the repository each day. That’s some serious activity. While the Linux open source operating spans 15 million lines of code across 40,000 software files, Google engineers modify 15 million lines of code across 250,000 files each week."

It is not clear from the announcement nor the code, but in principle, I don't see a reason that it can't be a DVCS.

Sure, GVFS downloads files only when first read; but maybe it keeps them cached? Maybe you can still work on them and commit changes after you get offline? At least in principle, nothing prevents that.