Hacker News new | ask | show | jobs
by Datenstrom 2257 days ago
Have you looked into DVC[1] for versioning the data and pipelines that generate them? I have set up a few versioned dataset repositories with it now and quite like it, especially the ability to simply `dvc import` the versioned data into projects and checkout different versions for testing with various models.

It operates on data at the same level as git but with features needed for large datasets and is totally language and framework agnostic like git.

[1]: https://dvc.org/

1 comments

We looked into it, but it seems to be solving a different problem - how to handle large data. Does it solve merging of structured data?

E.g. a json file is chanaged on 2 machines, and you need to merge the changes. Sometimes you can merge (e.g. 2 different entries in an array where people are adding annotations), sometimes you need to raise an error - e.g. changes in a single record, but for different fields - depending on a problem, you may disallow it, to keep the record consistent.