Hacker News new | ask | show | jobs
by adhocmobility 1351 days ago
If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.
5 comments

I've used both extensively. Git-lfs has always been a nightmare. Because each tracked large file can be in one of two states - binary, or "pointer" - it's super easy for the folder to get all fouled up. It would be unable to "clean" or "smudge", since either would cause some conflict. If you accidentally pushed in the wrong state, you could "infect" the remote and be really hosed. I had this happen numerous times over about 2 years of using lfs, and each time the only solution was some aggressive rewriting of history.

That, combined with the nature of re-using the same filename for the metadata files, meant that it was common for folks to commit the binary and push it. Again, lots of history rewriting to get git sizes back down.

Maybe there exist solutions to my problems but I had spent hours wrestling with it trying to fix these bad states, and it caused me much distress.

Also configuring the backing store was generally more painful, especially if you needed >2GB.

DVC was easy to use from the first moment. The separate meta files meant that it can't get into mixed clean/smudge states. If you aren't in a cloud workflow already, the backing store was a bit tricky, but even without AWS I made it work.

We resolve this in two ways

1. All git-lfs files are kept in the same folder

2. No one can directly push commits to one of the main branches, they need to raise a PR. This means that commits go through review and its easy to tell if they've accidentally commit a binary, and we can just delete their branch form the remote bringing the size back down.

I think the one thing that DVC does a bit better than git-lfs is that DVC doesn't keep the files directly in the repo. DVC puts a pointer file with a path and a hash of the file (to detect change). As far as I can tell, git-lfs only keeps them in the .git path of the repo.

For example, I think CodeOcean might use git-lfs under the hood but handles upload download separately from the UI. In the below sample, you can clone the repo from the Capsule menu but data and results are downloadable from a contextual menu available from each, respectively.

https://codeocean.com/capsule/2131051/tree/v1

I do feel like git-lfs is a good solution. Once you have 10s or 100s of GB of files (eg. a computer vision project), this gets pretty pricey.

Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage.

Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way.

If you know of any such solution, please let me know!

Have you tested Weights & Biases Artifacts[1]?

It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.

You can also use your existing object store and link it for very large / sensitive data.[2]

Disclaimer: I work at W&B.

[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...

+1. git-lfs is sufficient for tracking binaries, including a ML model, at that cadence.

Thinking more abstractly, there is benefit for code and data to live "next" to each other, if possible. Atomically committed to a codebase and the latter loaded / used by the former without connecting to yet another workflow.

It seems to be the solution Hugging Face have picked too.