Hacker News new | ask | show | jobs
by gvb 3427 days ago
Using git with large repos and large (binary blob) files has been a pain point for quite a while. There have been several attempts to solve the problem, none of which have really taken off. I think all the attempts have been (too) proprietary – without wide support, it doesn’t get adopted.

I'll be watching this to see if Microsoft can break the logjam. By open sourcing the client and protocol, there is potential...

Other attempts:

* https://github.com/blog/1986-announcing-git-large-file-stora...

* https://confluence.atlassian.com/bitbucketserver/git-large-f...

Article on GitHub’s implementation and issues (2015): https://medium.com/@megastep/github-s-large-file-storage-is-...

3 comments

I think Joey Hess' attempt at "solving the problem" deserves a mention.

It is open source (GPLV3) licensed. [not proprietary]

Written in Haskell. [cool aid]

Currently has 1200+ stars on Github and is part of at least Ubuntu (http://packages.ubuntu.com/search?keywords=git-annex) since 12.04. [shows something for support and adoption]

edit: Link to Github https://github.com/joeyh/git-annex -- thanks dgellow

For the problem of large files I think Git LFS has largely won out over git annex, mostly because it's natively supported by GitHub and GitLab and requires no workflow changes to use.
Atlassian's Bitbucket and Microsoft's Visual Studio Team Services both also support Git LFS.
As of version 6 of git annex, the only thing an unlocked repo need other than the usual workflow is a `git annex sync`, that could be easily configured as a push hook.
There's a small but important trap to people who might want to use git-annex as a backup tool, namely that you can't store a git repo in git-annex.

http://git-annex.branchable.com/forum/Storing_git_repos_in_g...

Having nested git repositories is a solved problem both in git and in git-annex: use submodules.

https://git-annex.branchable.com/submodules/

Link to the github mirror https://github.com/joeyh/git-annex
git-annex is, IMHO, by far the best solution.

Pros of git-annex:

- it is conceptually very simple: use symlinks instead of ad-hoc pointer files, virtual files system, etc. to represent symbolic pointer that point to the actual blob file;

- you can add support for any backend storage you want. As long as it support basic CRUD operations, git-annex can have it as a remote;

- you can quickly clone a huge repo by just cloning the metadata of the repo (--no-content in git-annex) and just download the necessary files on-demand;

And many other things that no other attempt even consider having, like client-side encryption, location tracking, etc.

Git Annex is only a partial solution, since it only solves issues with binary blobs. It doesn't solve problems with large repos.
That still only solves half the problem with large binary blobs.

The other half is that almost all of the binary formats can't be merged and so you need a mechanism to lock them to prevent people from wiping out other people's changes. Unfortunately that runs pretty much counter the idea of DCVS.

I always wonder why this never gets discussed much. We seem to have tons of solutions for storing large files outside the repo, but so what? OK, so I don't deny that storage still isn't cheap enough to just say "oh well" to the idea of a multi-TByte repo, so it's certainly solving a problem. But there's still another major problem left!

Don't their artists and designers use version control too? Maybe they just have one such person per team, or each person owns one file, or something like that. Hard to say.

Maybe it's like how I used to work on teams that never used branches - you have various problems that you figure there's probably a solution for, but there's never time to (a) figure out what the solution looks like, (b) shift the whole team over to a brand new workflow and set of tools, and (c) clean up the inevitable mess. So you just work around the problems the same way you always have - because at least that's a known quantity.

Yeah, best solution I've seen(but haven't had a chance to use in anger) is P4-Fusion(P4 backend, git views).

Perforce was always the gold standard for stuff like this. Did a great job at not only providing locking but stuff like seamless proxies and other solutions to common problems in that domain(like a usable UI).

When git-annex finds a conflict it can't solve, it gives back to you the two versions of the same file with the SHA of the original versions suffixed.

This way you can look at both and resolve the conflict.

That's the exact point I'm making, fundamentally a large portion of these formats(PSD, JPG, PNG, MA, 3DS) can't be resolved.

If two people touch the same file at the same time someone is going to drop their work on the floor and that's a bad thing(tm). You need to synchronize their work with a locking mechanism that informs the user at the edit(not sync) point in the workflow.