Hacker News new | ask | show | jobs
by gschoeni 1141 days ago
We have been working on a data version control tool called Oxen that is tackling many of your needs. Feel free to check it out here:

https://github.com/Oxen-AI/oxen-release#-oxen

Going down your list of requirements, Oxen has:

* Data versioning, similar paradigm to git, but built from the ground up for large ML datasets

* Inexpensive storage, comparable pricing to s3

* Branching/Merging for maintaining production training data sets

* Metadata storage and query capabilities, works with many structured data types. Have APIs for querying.

* User interface for less tech savy people, building out a hub at https://www.oxen.ai to enable this.

* Being able to define datasets that are a subset of the whole collected data (is this a similar requirement to querying?)

* Data ingestion pipeline - engineers would have to hook into APIs or CLI tools right now.

Feel free to check it out and leave any feedback on the GitHub repo!