|
|
|
|
|
by mcncfie
2688 days ago
|
|
Hmm. If I understand correctly, in order to reproduce the steps taken in creating machine learning models, I need to version control more things than just the code: 1. Code 2. Configuration (libraries etc) 3. Input/training data 1 and 2 are easily solved with Git and Docker respectively, although you would need some tooling to keep track of the various versions in a given run. 3 doesn't quite figure. According to the site DVC uses object storage to store input data but that leads to a few questions: 1. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC just a wrapper for these tools? 2. Why wouldn't I just version control the query that created the data along with the code that creates the model? 3. What if I'm working on a large file and make a one byte change? I've never come across an object store that can send a diff, so surely you'd need to retransmit the whole file? |
|
Answers:
1. DVC does dependency tracking in addition to that. It is like a lightweight ML pipelines tool or ML specific Makefile. Also, DVC works just faster that LFS which is critical in 10Gb+ cases.
2. This is a great case. However, in some scenarios, you would prefer to store the query output along with the query and DVC helps with that.
3. Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones - https://dvc.org/doc/commands-reference/gc