| HN Mirror

I think active learning has a time and a place. If you're getting started with a project from scratch, you probably don't need active learning for the exact reasons you describe - as long as you still get good improvements to model performance by labeling randomly sampled data, then you should scale out your labeling to get more data faster. For modern convnets fine-tuned on image data, I don't think you should think about active learning until you're past 10,000 examples.

Active learning becomes really useful when you hit diminishing returns, as most real-world ML applications deal with long tail distributions, and random sampling doesn't pick out edge cases for labeling very well. An easy way to tell if you're encountering diminishing returns is to do an ablation study where you train the same model against different subsets of your train set, evaluate them against each other on the same test set, and plot out the curve of model performance vs dataset size to see if you're starting to plateau. Or just eyeball your model errors and try to see if there's any patterns of edge cases it fails on.

Lastly, I'm pretty skeptical of model-based uncertainty sampling. In industry, almost every active learning implementation is very "what data should we label next," since model-based active learning is pretty hard to set up and confidence sampling is often not very reliable. That being said, I've anecdotally heard of some teams getting great performance from Bayesian methods once you have a large enough base dataset.

Anyway, here's my shameless plug for a post we wrote on the topic: https://medium.com/aquarium-learning/you-should-try-active-l...