| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nshm 1465 days ago

For many domains active learning is not that efficient actually. The promise is that you make a subset of labels and train on them the model with the same accuracy. The reality is that in order to estimate long tail properly you need all the data points in the training set, not just a subset.

Consider simple language model case. In order to learn some specific phrases you need to see them in the training, and phrases of interest are rare (usually 1-2 cases per terabytes of data). You simply can not select a half.

A semi-supervised learning and self-supervised learning are more reasonable and widely used. You still consider all the data for training. You just don't annotate it manually.

1 comments

parnoux 1465 days ago

You are right. Being able to learn good feature representations through SSL is very powerful. We leverage such representation to perform tasks like semantic search to tackle problems like long tail sampling. We have seen pretty good results mining for edge cases. Let me know if you'd like to chat about it.

link