|
|
|
|
|
by kernelsanderz
1349 days ago
|
|
I do feel like git-lfs is a good solution. Once you have 10s or 100s of GB of files (eg. a computer vision project), this gets pretty pricey. Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage. Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way. If you know of any such solution, please let me know! |
|
It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.
You can also use your existing object store and link it for very large / sensitive data.[2]
Disclaimer: I work at W&B.
[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...