Hacker News new | ask | show | jobs
by eliomattia 1139 days ago
That is really interesting and begs the question of how frequently you have changes in your data that lead to new commits. I am assuming here that you don't dedupe anything, that is, you throw the entire files into Azure with each version, since it's cheap enough for your purposes. Also, how frequently do you move head, even without committing anything new, perhaps to use another branch?
1 comments

LFS stores files by content hash, so deduplication happens that way. But you're right that if you frequently make small changes to a single large file, it's wasteful.

In our case though we don't frequently change files, we just get lots and lots of new big files coming in all the time.

Moving head, as in, to check out another branch locally? Somewhat regularly I guess. I suppose you're wondering about performance in that scenario? It's usually quite good since git-lfs does some local caching as well. I've never needed to wait longer than a couple of seconds. I'm usually on a wired 1000/1000 Mbit optic fibre connection, and transfers are directly to and from an azure blob storage container (the LFS API server only generates download and upload URLs, it intentionally doesn't transfer any data), with parallel connections and chunking etc, so it doesn't really get any better than that. And all of that is out of the box functionality too. :)

Sorry I should have been more specific, I meant block deduplication, or any form of deduplication at a level lower than the entire file. File deduplication can only get you so far, depending on the use case. XetHub does block deduplication, whereas I am implementing data-level deduplication, which is slower in recreating dataset snapshots (can be parallelized and delegated), but allows savings on disk space with small but frequent changes and can be tied to collaborative features to show diffs, comment on them, and revert or edit changes where needed, all while pointing clearly to specific commits. And also potentially fork data or cumulative changes.

Yes I meant either checking out other branches locally, or in the general case pointing to another branch to indicate to any services to make data from that branch available to wherever it's consumed. I am assuming that each incoming new file is then added to data pipelines, possibly just a few. Sounds like you are in the sweet spot where you have the speed you want and, given unfrequent changes, you are fine with the versions taking up terabytes on Azure, since they are mostly new data.