Hacker News new | ask | show | jobs
by kernelsanderz 1349 days ago
I do feel like git-lfs is a good solution. Once you have 10s or 100s of GB of files (eg. a computer vision project), this gets pretty pricey.

Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage.

Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way.

If you know of any such solution, please let me know!

1 comments

Have you tested Weights & Biases Artifacts[1]?

It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.

You can also use your existing object store and link it for very large / sensitive data.[2]

Disclaimer: I work at W&B.

[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...