| I am currently in charge of deciding on the tech stack for a large scale AI project in the computer vision space. Most things are settled, but we expect to collect a LOT of data that will be labeled and or auto labeled ( to the tune of 100 MIO video clips ) We will be training multiple models for different tasks from that data and we need a good system to organize it. Does anybody have any tips experiences with that kind of thing.
We can use any on premise or cloud solution.... Specifically we would need * Data ingestion pipeline ( data will come from field personel )
* Data versioning
* Being able to define datasets that are a subset of the whole collected data
* Inexpensive storage ( e.g S3 or similar )
* Branching/Merging for maintaining production training data sets
* Metadata storage and query capabilities ...
* User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT ) I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset. Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data... |