| This is by no means a perfect match for your requirements, but I'll share a CLI tool I built, called Dud[0]. At the least it may spur some ideas. Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2]. With Dud I focused on speed and simplicity. To your three points above: 1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed. 2) Dud checks out binaries as links by default, so it's super fast to switch between commits. 3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage. I hope this helps, and I'd be happy to chat about it. [0]: https://github.com/kevin-hanselman/dud [1]: https://dvc.org [2]: https://github.com/kevin-hanselman/dud#concrete-differences-... |