Hacker News new | ask | show | jobs
by sriku 2167 days ago
Labels are knowledge about data. If you already know some rules that work reasonably well based on your domain experience, then Snorkel lets you capture those as "labeling functions" that may not cover the whole ground or can be "noisy". Snorkel can then build a model to label your data accounting for the "noise". Combining that with some "gold" labels (done by humans), you can use the generated labels on a large data set to build a higher quality model that generalizes better. This is similar to how you can take several low quality models and by virtue of them having expertise over different parts of the data, build an "ensemble" model that performs better than any of them.

Imho, Snorkel kinds of tools ("weak supervision") are game changers for ML .. though the biggies get all the press. So I'm excited to see this end to end direction taken by the team.

1 comments

is not this done for years and called synthetic data generation, simulation etc.
Not data generation. Label generation. .. but the charitable interpretation of your question is valid - we've been doing such ensembling to make higher quality models for some time now. It's getting some good structure, practice and tooling around it is what I feel.
Yeh then advertise it as a tool rather than AI. The problem is that sorkle is trying to sell snake oil on the name of Stanford and AI. Under the hood it is just a data generation pipeline. Remember you can't put label on random data. So "Not data generation. Label generation" is totally does not make any sense and sound to me like "brown sugar".