Hacker News new | ask | show | jobs
by jll29 1965 days ago
I would agree - active learning is a neat idea, but while it gets up the learning curve quicker that does not necessarily correspond to saving data in practice, for two reasons.

First, a lot of the AL papers use _simulation_ scenarios rather than production scenarios, i.e. there is already more training data available, it just gets withheld. Obviously, if you already have more, you have spent annotating it, too, so there can't have been any saving.

Second, you always want to annotate more data than you have as long as the learning curve isn't flat, so it's not about how quickly you get up, but it's about should you keep annotating or does a flattening learning curve suggest you have reached the area of diminishing returns.

There are many sampling strategies like balance exploration & exploitation, expected model change, expected error reduction, exponentiated gradient exploration, uncertainty sampling, query by committee, querying from diverse subspaces/partitions, variance reduction, conformal predictors or mismatch-first farthest-traversal, and there isn't a theory to pick the best one given what you know (I've mostly heard people play with uncertainty sampling or query by committee in academia, but nobody in industry I know has told me they use AL).

2 comments

I think active learning has a time and a place. If you're getting started with a project from scratch, you probably don't need active learning for the exact reasons you describe - as long as you still get good improvements to model performance by labeling randomly sampled data, then you should scale out your labeling to get more data faster. For modern convnets fine-tuned on image data, I don't think you should think about active learning until you're past 10,000 examples.

Active learning becomes really useful when you hit diminishing returns, as most real-world ML applications deal with long tail distributions, and random sampling doesn't pick out edge cases for labeling very well. An easy way to tell if you're encountering diminishing returns is to do an ablation study where you train the same model against different subsets of your train set, evaluate them against each other on the same test set, and plot out the curve of model performance vs dataset size to see if you're starting to plateau. Or just eyeball your model errors and try to see if there's any patterns of edge cases it fails on.

Lastly, I'm pretty skeptical of model-based uncertainty sampling. In industry, almost every active learning implementation is very "what data should we label next," since model-based active learning is pretty hard to set up and confidence sampling is often not very reliable. That being said, I've anecdotally heard of some teams getting great performance from Bayesian methods once you have a large enough base dataset.

Anyway, here's my shameless plug for a post we wrote on the topic: https://medium.com/aquarium-learning/you-should-try-active-l...

There's an SV company Citrine Informatics that claims to.