Hacker News new | ask | show | jobs
by parnoux 1464 days ago
Thanks for the question and the papers. Like some of those companies, we are believers in the data centric approach to ML. But labeling is not our focus (unlike Snorkel, HumanLoop, Prodigy or Cord.Tech). We focus on model diagnosis and mining the best data to improve them. So there are more similarities with Aquarium or Lightly.

There is a great talk from Tesla [1] on what they call the "Data Engine" (that probably inspired some of us :)). One of the things we took from it was that in order to truly close the loop on the ML data flywheel, we needed to turn production into a reliable datasource. It had to become accessible, understandable and minable. To achieve this we took the approach of combining ML observability with Active Learning mining frameworks. Combining both is important in our view because Observability tells how the model behaves in the real world and Active Learning finds the right samples to fix / improve the model on real world data. They go hand in hand.

Technically, it means that we integrate with serving and labeling platforms. We ingest data both in streaming and batch. We can mine on production streams including on device (for iot use-cases where accessing data is a challenge). We have an extensive set of metrics to understand model behavior in the wild and solve use cases like data drift (detecting it, triggering mining and sending the data for labeling/retraining). And we are geared toward automation.

Regarding which data sampling works well and which doesn't, we found that it's not a one size fits all. Combining uncertainty sampling and diversity sampling is very powerful in a lot of use-cases and can perform as good as random sampling with 10x less data some times. But model based sampling strategies can also underperform on drifted datasets (essentially a model can be very confidently wrong on new kind of sample), hence the need to also have similarity sampling techniques.

Overall, we were able to show that we can intentionally drive specific model performance metrics, either globally or locally, by picking one technique vs another. Happy to share more if you want.

[1] https://www.youtube.com/watch?v=Ucp0TTmvqOE&t=7714s

1 comments

>Technically, it means that we integrate with serving and labeling platforms. We ingest data both in streaming and batch. We can mine on production streams including on device (for iot use-cases where accessing data is a challenge).

hold on - are you saying you look at production serving data ...and are able to determine what was the problem in training data that caused it ? That is pretty cool.

Yes, that's correct. We integrate with the major ML frameworks to monitor serving data and compare it to the training data to identify potential error patterns and mine the live stream for data to fix them. I'd love to show you the product in more details and get your feedback if you're open to it !
hi. so i dont use this kind of modeling today (i used to in my previous product).

But the reason i asked is - that is a fantastic feature and differentiation. i wonder why you dont put that claim up as the hero on your website ? i dont see a reason why anyone would NOT use something like this.