| HN Mirror

You hit the nail on the head; for many straight information extraction problems, the universe of documents you want to extract could be too small to learn a model from -- or for that matter, for conventional methods of model evaluation to apply. (You want to extract data from all the items, not prove you reached a certain level of accuracy on a sub-sample of them)

One approach is

https://blog.openai.com/unsupervised-sentiment-neuron/

where you can throw in a great amount of unlabeled data and build an internal representation that models the data well enough that you can train something that works like an HMM or CRF with a tiny amount of labeled data.

If you are willing to do something rule-based, I've used

https://en.wikipedia.org/wiki/Case-based_reasoning

to organize the work in annotating corpuses. Often I can prove that a certain rule set covers X% of the cases, then add a rule to do X+epsilon% until the results are "good enough".

Feel free to click on my profile link and send me a message if you want to chat more.