| Nice to see some active learning around here. To add a data point from a less successful story: In one of our research projects, we used AL to improve part-of-speech prediction, inspired by work by Rehbein and Ruppenhofer, e.g. https://www.aclweb.org/anthology/P17-1107/ Our data base was a corpus of Scientific English from 17th-now and for our data and situation, we found that choosing the right tool/model and having the right training data were the most important things. Once that was in place, active learning did not, unfortunately, add that much. For different tools/settings, we got about +/-0.2% in accuracy for checking 200k tokens and only correcting 400 of them. Maybe one problem was that AL was only triggered when a majority vote was inconclusive. Also, we used it on top of individualised, gs training data. I guess things can look different if you don't have a gs to start with. And if you have better computational resources: Our oracles spent quite some time waiting, which is why we even reorganised the original design to then process batches of corrections. As so often, those null results were hard to publish :| Either way, I thought I'd share our experiences. Your work sounds really cool, best of luck! |
First, a lot of the AL papers use _simulation_ scenarios rather than production scenarios, i.e. there is already more training data available, it just gets withheld. Obviously, if you already have more, you have spent annotating it, too, so there can't have been any saving.
Second, you always want to annotate more data than you have as long as the learning curve isn't flat, so it's not about how quickly you get up, but it's about should you keep annotating or does a flattening learning curve suggest you have reached the area of diminishing returns.
There are many sampling strategies like balance exploration & exploitation, expected model change, expected error reduction, exponentiated gradient exploration, uncertainty sampling, query by committee, querying from diverse subspaces/partitions, variance reduction, conformal predictors or mismatch-first farthest-traversal, and there isn't a theory to pick the best one given what you know (I've mostly heard people play with uncertainty sampling or query by committee in academia, but nobody in industry I know has told me they use AL).