| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ajoseps 650 days ago
	if the data files are all just text files, what are the differences between DVC and using plain git?

3 comments

miki123211 650 days ago

DVC does a lot more than git.

It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.

There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.

It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.

In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.

link

amelius 649 days ago

Sounds like it is more a framework than a tool.

Not everybody wants a framework.

link

JadeNB 649 days ago

> Sounds like it is more a framework than a tool.

> Not everybody wants a framework.

The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.

link

amelius 649 days ago

Just saying that what makes Git so appealing is that it does one thing well, and from this view DVC seems to be in an entirely different category.

link

stochastastic 649 days ago

It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.

link

bach4ants 643 days ago

Yep. I personally like DVC's pipeline implementation because it's lightweight and language-agnostic, but haven't gotten into using their experiment tracking features.

link

woodglyst 649 days ago

This sounds a lot like the experimental project Jacquard [0] from Ink & Switch.

[0] https://www.inkandswitch.com/jacquard/notebook/

link

azinman2 650 days ago

So where do the adjusted 10M rows live instead? S3?

link

thangngoc89 649 days ago

DVC support multiple remotes. S3 is one of them, there are also WebDAV, local FS, Google Drive, and a bunch of others. You could see the full list here [0]. Disclaimer: not affiliated with DVC in anyway, just a user.

[0] https://dvc.org/doc/user-guide/data-management/remote-storag...

link

dmpetrov 650 days ago

In this cases, you need DVC if:

1. File are too large for Git and Git LFS.

2. You prefer using S3/GCS/Azure as a storage.

3. You need to track transformations/piplines on the file - clean up text file, train mode, etc.

Otherwise, vanilla Git may be sufficient.

link

agile-gift0262 649 days ago

It's not just to manage file versioning. Yo can define a pipeline with different stages, the dependencies and outputs of each stage and DVC will figure out which stages need running depending on what dependencies have changed. Stages can also output metrics and plots, and DVC has utilities to expose, explore and compare those.

link