| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by KRAKRISMOTT 1202 days ago

Do you plan to add data management too? Because those are the biggest features offered by your competitors like weights and biased. Having a place to dump and load a few hundred gigabytes of data is very important because many on-demand cloud compute services don't offer persistence. Most ML training at scale aren't using Colab notebooks beyond initial prototyping because it's too expensive. Dealing with a cluster of servers and running Jupyter on them is already annoying enough, so having data management abstracted away makes life a lot easier.

https://wandb.ai/site/artifacts

Make sure to talk to your users while building this. Some platforms didn't, for example

https://docs.grid.ai/features/datastores

Grid/Lightning's data management is half baked. They only allow mounting one set of data per instance, which is close to useless for any training beyond the most simplistic of applications because most data aren't nicely cleaned. You often have to bring together disparate sets of data for multi-modal applications.

2 comments

sourabh03agr 1202 days ago

Thanks for the question! Our initial focus is more on how to find the most relevant data-points from the hundred gigabytes of data to retrain the model on. Our current data management strategy is pretty primitive, either local files or we connect back to your data warehouse for persistence.

Soon, we plan to add data management features too but primarily on the production side so that data scientists can safely and securely version the data which their AI application came across in production as well as use it to refine their model (if allowed)

link

vvipgupta 1202 days ago

Thanks for the suggestion and links. Completely agree, ML production data management can be painful and to support model refinement for users that operate at scale, an abstraction at the data layer would be a useful feature.

link