|
|
|
|
|
by Datenstrom
2257 days ago
|
|
Have you looked into DVC[1] for versioning the data and pipelines that generate them? I have set up a few versioned dataset repositories with it now and quite like it, especially the ability to simply `dvc import` the versioned data into projects and checkout different versions for testing with various models. It operates on data at the same level as git but with features needed for large datasets and is totally language and framework agnostic like git. [1]: https://dvc.org/ |
|
E.g. a json file is chanaged on 2 machines, and you need to merge the changes. Sometimes you can merge (e.g. 2 different entries in an array where people are adding annotations), sometimes you need to raise an error - e.g. changes in a single record, but for different fields - depending on a problem, you may disallow it, to keep the record consistent.