Hacker News new | ask | show | jobs
by realradicalwash 1962 days ago
Nice to see some active learning around here. To add a data point from a less successful story:

In one of our research projects, we used AL to improve part-of-speech prediction, inspired by work by Rehbein and Ruppenhofer, e.g. https://www.aclweb.org/anthology/P17-1107/

Our data base was a corpus of Scientific English from 17th-now and for our data and situation, we found that choosing the right tool/model and having the right training data were the most important things. Once that was in place, active learning did not, unfortunately, add that much. For different tools/settings, we got about +/-0.2% in accuracy for checking 200k tokens and only correcting 400 of them.

Maybe one problem was that AL was only triggered when a majority vote was inconclusive. Also, we used it on top of individualised, gs training data. I guess things can look different if you don't have a gs to start with. And if you have better computational resources: Our oracles spent quite some time waiting, which is why we even reorganised the original design to then process batches of corrections.

As so often, those null results were hard to publish :|

Either way, I thought I'd share our experiences. Your work sounds really cool, best of luck!

1 comments

I would agree - active learning is a neat idea, but while it gets up the learning curve quicker that does not necessarily correspond to saving data in practice, for two reasons.

First, a lot of the AL papers use _simulation_ scenarios rather than production scenarios, i.e. there is already more training data available, it just gets withheld. Obviously, if you already have more, you have spent annotating it, too, so there can't have been any saving.

Second, you always want to annotate more data than you have as long as the learning curve isn't flat, so it's not about how quickly you get up, but it's about should you keep annotating or does a flattening learning curve suggest you have reached the area of diminishing returns.

There are many sampling strategies like balance exploration & exploitation, expected model change, expected error reduction, exponentiated gradient exploration, uncertainty sampling, query by committee, querying from diverse subspaces/partitions, variance reduction, conformal predictors or mismatch-first farthest-traversal, and there isn't a theory to pick the best one given what you know (I've mostly heard people play with uncertainty sampling or query by committee in academia, but nobody in industry I know has told me they use AL).

I think active learning has a time and a place. If you're getting started with a project from scratch, you probably don't need active learning for the exact reasons you describe - as long as you still get good improvements to model performance by labeling randomly sampled data, then you should scale out your labeling to get more data faster. For modern convnets fine-tuned on image data, I don't think you should think about active learning until you're past 10,000 examples.

Active learning becomes really useful when you hit diminishing returns, as most real-world ML applications deal with long tail distributions, and random sampling doesn't pick out edge cases for labeling very well. An easy way to tell if you're encountering diminishing returns is to do an ablation study where you train the same model against different subsets of your train set, evaluate them against each other on the same test set, and plot out the curve of model performance vs dataset size to see if you're starting to plateau. Or just eyeball your model errors and try to see if there's any patterns of edge cases it fails on.

Lastly, I'm pretty skeptical of model-based uncertainty sampling. In industry, almost every active learning implementation is very "what data should we label next," since model-based active learning is pretty hard to set up and confidence sampling is often not very reliable. That being said, I've anecdotally heard of some teams getting great performance from Bayesian methods once you have a large enough base dataset.

Anyway, here's my shameless plug for a post we wrote on the topic: https://medium.com/aquarium-learning/you-should-try-active-l...

There's an SV company Citrine Informatics that claims to.