Hacker News new | ask | show | jobs
by lizen_one 1356 days ago
DVC has had the following problems, when I tested it (half a year ago):

I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.

You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.

Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.

Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.

There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.

This sounds negative but I think it is currently the one of the best tools in this space.

6 comments

You might be referring to me/Dud[0]. If you are, first off, thanks! I'd love to know more about what development progress you are hoping for. Is there a specific set of features that bar you from using Dud? As far as testing, Dud has a large and growing set of unit and integration tests[1] that are run in Github CI. I'll never have the same resources as Iterative/DVC, but my hope is that being open source will attract collaborators. PRs are always welcome ;)

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...

> You are encouraged if you model your pipeline in DVC.

Encouraged to do what?

You might want to slow down on the use of parentheses, we are both getting lost in them.

I assume they meant to say "you are encouraged to use DVC to run your model and experiment pipeline". They want to encourage you to do this because they are trying to build a business around being a data science ops ecosystem. But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space. it could be improved in that regard, but that's just not what I use it for nor is it what I recommend it to other people for.

However it's fantastic for tracking artifacts throughout an project that have been generated by other means, and for keeping those artifacts tightly in sync with Git, and for making it easy to share those artifacts without forcing people to re-run expensive pipelines.

> But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space.

Would love your feedback what's missing there! We've been improving it lately - e.g.

- Hydra support https://dvc.org/doc/user-guide/experiment-management/hydra

- VS Code extension - https://marketplace.visualstudio.com/items?itemName=Iterativ...

Last I checked it wasn't easy to use something like optuna to do hyperparameter tuning with hydra/DVC.

Ideally I'd like the tool I use for data versioning (DVC/git-lfs/gif-annex) to be orthogonal to that which I use for hyperparameter sweeping (DVD/optuna/SageMaker experiments), and orthogonal to that which I use for configuration management (DVC/Hydra/Plain YAML), to that what I use for experimental DAG management (DVC/Makefile)

Optuna is becoming very popular in the data-science/deep learning ecosystem at the moment. It would be great to see more composable tools, rather than having to opt all-in into a given ecosystem.

Love the work that DVC is doing though to tackle these difficult problems though!

Big +1 about composability and orthogonality. I don't want one "do it all" tool, I want a collection of small tools that interoperate nicely. Like how you can use Airflow and DBT together, but neither tool really tries to do what the other one does (not that Airflow is "small", but still).
DVC is great for use cases that don't get to this scale or have these needs. And the issues here are non-trivial to solve. I've spent a lot of time figuring out how to solve them in Pachyderm which is good for use cases where you do need higher levels of scale or might run into merge conflicts with DVC. There's trade-offs though. DVC is definitely easier for a single developer / data scientist to get up and running with.
I think it's worth noting that DVC can be used to track artifacts that have been generated by other tools. For example, you could use MLFlow to run several model experiments, but at the end track the artifacts with DVC. Personally I think that this is the best way to use it.

However I agree that in general it's best for smaller projects and use cases. for example, it still shares the primary deficiency of Make in that it can only track files on the file system, and now things like ensuring a database table has been created (unless you 'touch' your own sentinel files).

The alternative tool you are referring to is `Dud` I believe

Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).

What alternatives would you recommend?

What’s best if parallel step processing is required?
Yeah we had a lot of problems with things getting out of sync and we just got tired of it