| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kvnhn 1572 days ago

This is by no means a perfect match for your requirements, but I'll share a CLI tool I built, called Dud[0]. At the least it may spur some ideas.

Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].

With Dud I focused on speed and simplicity. To your three points above:

1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.

2) Dud checks out binaries as links by default, so it's super fast to switch between commits.

3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.

I hope this helps, and I'd be happy to chat about it.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://dvc.org

[2]: https://github.com/kevin-hanselman/dud#concrete-differences-...

1 comments

avar 1571 days ago

I'd be curious to see if you've tried git-annex, I use it instead of git-lfs when I need to manage big binary blobs. It does the same trick with a "check out" being a mere symlink.

link

kvnhn 1571 days ago

I haven't used it, no. Around the time Git LFS was released, my read from the community was that Git LFS was favored to supersede git-annex, so I focused my time investigating Git LFS. Given that git-annex is still alive and well, I may have discounted it too quickly :) Maybe I'll revisit it in the future. Thanks for sharing!

link

avar 1570 days ago

Neither is favored, git-annex solves problems that git LFS doesn't even try to address (distributed big files), at the cost of extra complexity.

Git LFS is intended more for a centralized "big repo" workflow, git annex's canonical usage is as a personal distributed backup system, but both can stretch into other domains.

In this case git-annex seems to have a feature that git LFS doesn't have that would be useful to you.

link