|
I would agree - active learning is a neat idea, but while it gets up the learning curve quicker that does not necessarily correspond to saving data in practice, for two reasons. First, a lot of the AL papers use _simulation_ scenarios rather than production scenarios, i.e. there is already more training data available, it just gets withheld. Obviously, if you already have more, you have spent annotating it, too, so there can't have been any saving. Second, you always want to annotate more data than you have as long as the learning curve isn't flat, so it's not about how quickly you get up, but it's about should you keep annotating or does a flattening learning curve suggest you have reached the area of diminishing returns. There are many sampling strategies like balance exploration & exploitation, expected model change, expected error reduction, exponentiated gradient exploration, uncertainty sampling, query by committee, querying from diverse subspaces/partitions, variance reduction, conformal predictors or mismatch-first farthest-traversal, and there isn't a theory to pick the best one given what you know (I've mostly heard people play with uncertainty sampling or query by committee in academia, but nobody in industry I know has told me they use AL). |
Active learning becomes really useful when you hit diminishing returns, as most real-world ML applications deal with long tail distributions, and random sampling doesn't pick out edge cases for labeling very well. An easy way to tell if you're encountering diminishing returns is to do an ablation study where you train the same model against different subsets of your train set, evaluate them against each other on the same test set, and plot out the curve of model performance vs dataset size to see if you're starting to plateau. Or just eyeball your model errors and try to see if there's any patterns of edge cases it fails on.
Lastly, I'm pretty skeptical of model-based uncertainty sampling. In industry, almost every active learning implementation is very "what data should we label next," since model-based active learning is pretty hard to set up and confidence sampling is often not very reliable. That being said, I've anecdotally heard of some teams getting great performance from Bayesian methods once you have a large enough base dataset.
Anyway, here's my shameless plug for a post we wrote on the topic: https://medium.com/aquarium-learning/you-should-try-active-l...