Hacker News new | ask | show | jobs
by kkouddous 1460 days ago
We’ve been trying to implement an active-learning retraining loop for our critical NLP models for Koko but have never found the time to prioritize the work as it was multi-sprint level of effort. We’ve been working with them for the for a few weeks and we and we are seeing meaningful performance improvement with our models. I highly recommend trying them out.
1 comments

For many domains active learning is not that efficient actually. The promise is that you make a subset of labels and train on them the model with the same accuracy. The reality is that in order to estimate long tail properly you need all the data points in the training set, not just a subset.

Consider simple language model case. In order to learn some specific phrases you need to see them in the training, and phrases of interest are rare (usually 1-2 cases per terabytes of data). You simply can not select a half.

A semi-supervised learning and self-supervised learning are more reasonable and widely used. You still consider all the data for training. You just don't annotate it manually.

You are right. Being able to learn good feature representations through SSL is very powerful. We leverage such representation to perform tasks like semantic search to tackle problems like long tail sampling. We have seen pretty good results mining for edge cases. Let me know if you'd like to chat about it.