| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mcncfie 2688 days ago

Hmm. If I understand correctly, in order to reproduce the steps taken in creating machine learning models, I need to version control more things than just the code:

1. Code

2. Configuration (libraries etc)

3. Input/training data

1 and 2 are easily solved with Git and Docker respectively, although you would need some tooling to keep track of the various versions in a given run. 3 doesn't quite figure.

According to the site DVC uses object storage to store input data but that leads to a few questions:

1. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC just a wrapper for these tools?

2. Why wouldn't I just version control the query that created the data along with the code that creates the model?

3. What if I'm working on a large file and make a one byte change? I've never come across an object store that can send a diff, so surely you'd need to retransmit the whole file?

1 comments

dmpetrov 2688 days ago

@mcncfie your understanding is correct. #3 might include output data\models as well and intermediate results like preprocessed data. DVC also handles dependencies between all of these.

Answers:

1. DVC does dependency tracking in addition to that. It is like a lightweight ML pipelines tool or ML specific Makefile. Also, DVC works just faster that LFS which is critical in 10Gb+ cases.

2. This is a great case. However, in some scenarios, you would prefer to store the query output along with the query and DVC helps with that.

3. Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones - https://dvc.org/doc/commands-reference/gc

link

cyphar 2688 days ago

> Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones

Have you looked into using content-defined chunking (a-la restic or borgbackup) so that you get deduplication without the need to send around diffs? This is related to a problem that I'm working on solving in OCI (container) images[1].

[1]: https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

link

dmpetrov 2687 days ago

Content-defined chunks - very interesting. I'd suggest you ask this question in DVC issue tracker or DVC channel https://dvc.org/chat

link

mcncfie 2688 days ago

Thanks! Regarding 2, could you give an example?

Also, can I combine DVC with a pipeline tool like Apache Airflow?

link

dmpetrov 2688 days ago

Example. A query to DB gives you different results since the data\table evolves over time. So, you just store the query output (let say a couple GBs) in DVC to make your research reproducible.

This is like assigning a random-seed to DB :)

Sure, some teams combine DVC with AirFlow. It gives a clear separation between engineering (reliability) and data science (lightweight and quick iteration). A recent discussion about this: https://twitter.com/FullStackML/status/1091840829683990528

link