| We have been working on a data version control tool called Oxen that is tackling many of your needs. Feel free to check it out here: https://github.com/Oxen-AI/oxen-release#-oxen Going down your list of requirements, Oxen has: * Data versioning, similar paradigm to git, but built from the ground up for large ML datasets * Inexpensive storage, comparable pricing to s3 * Branching/Merging for maintaining production training data sets * Metadata storage and query capabilities, works with many structured data types. Have APIs for querying. * User interface for less tech savy people, building out a hub at https://www.oxen.ai to enable this. * Being able to define datasets that are a subset of the whole collected data (is this a similar requirement to querying?) * Data ingestion pipeline - engineers would have to hook into APIs or CLI tools right now. Feel free to check it out and leave any feedback on the GitHub repo! |