Hacker News new | ask | show | jobs
by cesarb 1870 days ago
With git-annex you have the same "one-way door" behavior: it replaces large files with a pointer to the content (in git-annex, it's a relative symbolic link which by default encodes the real file's size and hash), which is stored in git-annex's own database.
2 comments

Sort of. The way that the author talks about Mercurial as not having this problem makes me think they're talking about something related but subtly different. In particular, AFAICT, Mercurial requires the exact same thing as what you're pointing out. If you want to completely disable use of largefiles then you still have to run `hg lfconvert` at some point. That also changes your revision history.

The "one-way door" as I understand the article to be describing is talking about the additional layer of centralization that Git LFS brings. In particular it's pretty annoying to have to always spin up a full HTTPS server just to be able to have access to your files. There is now always a source of truth that is inconvenient to work around when you might still have the files lying around on a bunch of different hard drives or USB drives.

Whereas with git-annex, it is true that without rewriting history, even if you disable git-annex moving forward, you'll still have symlinks in your git history. However, as long as you still have your exact binary files sitting around somewhere, you can always import them back on the fly, so e.g. to move away from git-annex you can just commit the binary files directly to your git directory and then just copy them out to a separate folder whenever you go back to an old commit and re-import them.

But perhaps I'm interpreting the author incorrectly, in which case it's hard for me to see how any solution for large files in git would allow you to move back without rewriting history to an ordinary git repository without large file support.

> so e.g. to move away from git-annex you can just commit the binary files directly to your git directory and then just copy them out to a separate folder whenever you go back to an old commit and re-import them.

Exactly. Here's an (anonymized) example of a git-annex symlink from one of my repos:

    ../../.git/annex/objects/AA/BB/SHA256-s123456--abcdf...1234/SHA256-s8968192--abcdf...1234
It's just a link to a file with a SHA256 hash in the name and path. The simplest way to reconstruct that in the future is to just check-in the whole `objects` directory into the repo, and copy/symlink it back to `.git/annex` when needed. You definitely don't need the git-annex software itself to view the data in the future.

I personally have hundreds of gigabytes of data in git-annex repos. It works great!

I don’t think it’s clear but mercurial has two solutions for large file support. The original “largefiles” which had all the same designs and issues as Git LFS they bring up in the blog post, and “lfs” which is newer.

I’ve used largefiles and ran into these issues and ended up having to turn it off after a few years because it’s so problematic with the tooling since it modifies the underlying mercurial commit structure like git lfs.

However it sounds like mercurial lfs is different in that it only modifies the transport layer, though I’m not totally clear on the details and have been meaning to look into it further.

To preface: though I've read a fair amount about Mercurial, I can count on my fingers the number of times I've actually used a Mercurial repo and I've used largefiles only ever as a toy, so I am very much a Mercurial newbie. So there is a chance I may get something wrong here.

However, my impression is that in fact largefiles is basically the only game in town and Mercurial LFS if anything is meant to be even more like Git LFS to the point of being compatible with it.

The thing I'm more curious about is I don't immediately see how large file support in git (or mercurial), whether implemented as a separate tool or natively, could ever feasibly be "transparently erasable," that is rewindable back to be absolutely identical to a repository with no large files support without rewriting revision history.

It doesn't seem impossible (e.g. maybe you could somehow maintain a duplicate shadow revision history and transparently intercept syscalls?), but the approaches I can think of all have pretty hefty downsides and feel even more like hacks than the current crop of tools.

That content can easily be moved in bulk though. It is true that you have to use git-annex command to do so, but this is different from LFS where the complete set of historical files is only stored on the server and can't be moved at all.

edit: The article claims it's a "one-way door" because you can't move to an altogether different system without rewriting history, which is true of git-annex. My bad.