Hacker News new | ask | show | jobs
by korijn 1139 days ago
I use git lfs. There are filter options for all commands so you don't need to checkout any more data than you want to/need to. Works like a charm for me!

I'd be curious to hear what features you are missing. We have repositories that would be as big as 100GB if you downloaded all large files for the full history, but I guess I don't see why you would want do that?

2 comments

Which repo size after the filters do you work with on your machine and how many GBs do you have in Git LFS, that is, in the cloud? I hear people complain about costs, but it depends upon scale and change frequency, which can increase total repo size.
We wrote our own LFS API server (which is actually not very hard, about 100 lines of python was enough and it performs at scale) so we can directly leverage azure blob storage. If you don't walk this path and enable LFS in github or something like that the costs are obscene, yes. For us it's dirt cheap.

If I check out head of my repo and don't filter anything, it's a couple GBs.

Inside the azure blob storage container that backs our LFS API server, there's probably terabytes of data. It's really very very much.

We don't have any performance problems. One API instance can handle it. Of course we did make sure to implement it well... It's Uvicorn/Starlette, all IO is async and all CPU "intensive" work like JSON (de)serialization runs in a background threadpool.

That is really interesting and begs the question of how frequently you have changes in your data that lead to new commits. I am assuming here that you don't dedupe anything, that is, you throw the entire files into Azure with each version, since it's cheap enough for your purposes. Also, how frequently do you move head, even without committing anything new, perhaps to use another branch?
LFS stores files by content hash, so deduplication happens that way. But you're right that if you frequently make small changes to a single large file, it's wasteful.

In our case though we don't frequently change files, we just get lots and lots of new big files coming in all the time.

Moving head, as in, to check out another branch locally? Somewhat regularly I guess. I suppose you're wondering about performance in that scenario? It's usually quite good since git-lfs does some local caching as well. I've never needed to wait longer than a couple of seconds. I'm usually on a wired 1000/1000 Mbit optic fibre connection, and transfers are directly to and from an azure blob storage container (the LFS API server only generates download and upload URLs, it intentionally doesn't transfer any data), with parallel connections and chunking etc, so it doesn't really get any better than that. And all of that is out of the box functionality too. :)

Sorry I should have been more specific, I meant block deduplication, or any form of deduplication at a level lower than the entire file. File deduplication can only get you so far, depending on the use case. XetHub does block deduplication, whereas I am implementing data-level deduplication, which is slower in recreating dataset snapshots (can be parallelized and delegated), but allows savings on disk space with small but frequent changes and can be tied to collaborative features to show diffs, comment on them, and revert or edit changes where needed, all while pointing clearly to specific commits. And also potentially fork data or cumulative changes.

Yes I meant either checking out other branches locally, or in the general case pointing to another branch to indicate to any services to make data from that branch available to wherever it's consumed. I am assuming that each incoming new file is then added to data pipelines, possibly just a few. Sounds like you are in the sweet spot where you have the speed you want and, given unfrequent changes, you are fine with the versions taking up terabytes on Azure, since they are mostly new data.

Just read the second paragraph. Currently expanding merge resolution assistance to deal with the general merge conflict case, as well as implementing revert and cherry-pick assistance. Unsure if that is what you were wondering? You probably don't want to do that with 100 GB if most of your commits are new data, rather than changes, yet I wonder whether all incoming new files are then queued into the same pipelines and the reason why they are separate files is not to have to deal with one giant cumulative file for which the older parts would not be deduped in Git LFS, which is a great reason, or whether those files are anyway different file types in terms of contents and intended use, another great reason, or something else altogether. Are you processing data by including anything in a given folder in a given commit?
*I have just (only now) read the second paragraph in your message. Not sure if that came across correctly, that first sentence was too compressed.