Looks like in those benchmarks Oxen.AI makes a misguided assumption that benchmarking DVC is (roughly...?) the same as benchmarking DVC<>DAGShub (server side made by a different company). To my understanding DAGShub is a bottleneck there. They didn't care to benchmark DVC against an S3 bucket or a similar cloud storage that is more widely used. I wonder if it's because DAGShub makes this whole setup wayyy slower
Oxen dev here - let me add some benchmarks for DVC backed by an S3 bucket. I did it awhile back and we were still faster, but agree it's a good benchmark to have.
Fundamentally even adding and committing data locally is slower, even before the push. But I agree the remote matters too.
Where does that push to? Does this benchmark really just measure how well-provisioned various different VC-funded websites currently are?
I think a proper benchmark here would be install the server parts of Oxen, Git-LFS, etc on the same machine, and then time how long it takes to commit and push the same dataset from some other machine.
Although of course given that we live in an age where people expect to upload their immense datasets to the cloud for some reason, a "proper" benchmark might not be a relevant one. I'm not sure what a really good benchmark of that would be.
Will add a local network benchmark as well! Many reasons to upload your data to the cloud...but agree that there are use cases where you might just want to version on your local network.
Oxen seems more like git (with GitHub integration (Oxenhub)) for ML datasets, where DVC is a bit bit more like make (with S3, LFS, etc integration) for ML datasets. It seems like Oxen has finer granularity version control and diff capability, but as far as I can tell doesn’t have as many features to track and version derived data along with the code that produced it (like `dvc repro`)
One thing I love about DVC is that it doesn't need its own server. I can just push/pull files via SSH. I don't really want one more service that I need to keep running. I also happen to have a lot of space available to me on a server I can't install extra services on, so oxen requiring that is a deal breaker for me.
This is the real deal breaker for me.
Dvc is super slow but it works with S3 (one of the greatest technologies built in last 15 years). At our company, we've written own (10x) faster version of dvc for commonly used features.
Perhaps this is outside the scope of what Oxen aims to do, but I like that DVC has a way for me to specify scripts and dependencies and then decide what needs to be regenerated (and what doesn't) when dependencies change.
Cool! To be honest I don’t really use dvc much, but the project version control features are what really interest me. I like how data pipelines help align versioned artifacts like model checkpoints and visualizations with the datasets and code that produced. I work as a computational science and that sort of reproducibility tool is really important, and a lot of us don’t have the best software engineering skills/discipline.
From your readme it seems like the oxen repo and software project repo are not as closely coupled as in dvc? It seemed like in the current state of oxen, you could do something similar with make files and oxen tracking?
Oxen seems really good for longer lived data and computational science projects, where dvc seems more oriented just at analysis projects. I have a project that I want to try it out on :)
https://github.com/Oxen-AI/oxen-release/blob/main/Performanc...