Hacker News new | ask | show | jobs
by breck 1870 days ago
My practice for storing large files with Git is to include the metadata for the large file in a tiny file(s):

1. Type information. Enough to synthesize a fake example.

2. A simple preview. This can be a thumb or video snippet, for example.

3. Checksum and URL of the big file.

This way your code can work at compile/test time using the snippet or synthesized data, and you can fetch the actual big data at ship time.

You can then also use the best version control tool for the job for the particular big files in question.

2 comments

Is this just a manual equivalent of git LFS, or is there some advantage here?
This is pretty superior to git LFS in many aspects:

- You have file type and preview that you can use without getting the full thing

- You have a custom metadata for each file enforced by your scripts -- for example for archives, you may store the list of files inside. This will allow your CI tests to validate the references into the files without having to download the whole huge thing.

- You fully control remote fetch logic. Multiple servers? Migration rules for old revisions? That weird auth scheme that your IT insists on? It is all supported with a bit of code.

- You fully control local storage. Do you want a computer-wide shared CAS cache between multiple users? What if you have NAS that most users mount? Or maybe s3fs is your thing? Adding support is easy.

The main downside is that you get to do all the tooling and documentation, so I would not recommend this for the smaller teams. Nor would I recommend this for open-source projects.

But if your infra team is big enough to support this, you'll definitely have the better experience than generic Git LFS.

It's a design pattern that ensures testability of the system without any dependencies on the big files.
Tools like git-annex or dvc support similar strategies.
Is git-annex still alive? Last time I tried to use it, it was very rough, and the official wiki (that serves as doc + bug tracker) gives database errors trying to create an account.

Details: I wanted to have a remote I can push to but anonymous users can only pull from, couldn't piece it together.

I've always liked the simplicity of git-fat [0]:

* Initial setup includes git filter rules so that "git add" automatically uses get-fat for matching files (no need to remember to invoke git-fat when adding/changing files).

* It works by rsync'ing to/from the remote. The setup for this is in a single ".gitfat" file, separate from the filter rules.

* You do need to run "git fat push" and "git fat pull"; this can probably be automated with hooks.

So just offhand without even trying to think about the "right" way to do what you want, the committed ".gitfat" could be to a read-only remote, then you can swap it with your own un-committed file for a push that has an rsync-writeable remote.

Also, the whole thing is a single 628-line python file, so worst case it would be easy to tweak it to read something like ".gitfat-push" and not have to manually swap it.

[0] https://github.com/jedbrown/git-fat

Thanks I didn't know about this one! It seems to only support rsync though, so using it for public repositories would be difficult.