| HN Mirror

For many domains active learning is not that efficient actually. The promise is that you make a subset of labels and train on them the model with the same accuracy. The reality is that in order to estimate long tail properly you need all the data points in the training set, not just a subset.

Consider simple language model case. In order to learn some specific phrases you need to see them in the training, and phrases of interest are rare (usually 1-2 cases per terabytes of data). You simply can not select a half.

A semi-supervised learning and self-supervised learning are more reasonable and widely used. You still consider all the data for training. You just don't annotate it manually.