| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gschoeni 1188 days ago

We have been working on a data version control tool called Oxen that is tackling many of your needs. Feel free to check it out here:

Going down your list of requirements, Oxen has:

* Data versioning, similar paradigm to git, but built from the ground up for large ML datasets

* Inexpensive storage, comparable pricing to s3

* Branching/Merging for maintaining production training data sets

* Metadata storage and query capabilities, works with many structured data types. Have APIs for querying.

* User interface for less tech savy people, building out a hub at https://www.oxen.ai to enable this.

* Being able to define datasets that are a subset of the whole collected data (is this a similar requirement to querying?)

* Data ingestion pipeline - engineers would have to hook into APIs or CLI tools right now.

Feel free to check it out and leave any feedback on the GitHub repo!