This might not be a task for ML, especially assuming the only option would be unsupervised ML.
I would suggest using an ontology, or rolling your own from the English Wikipedia database dump, as a basis for tokenization of the menu text and go from there. What structured content exactly are you trying to extract?
What's the source format? If you're dealing with PDFs at least you have textual data, which could be matched against a recipe database. I haven't checked but services like Epicurious might offer an API for that.
In that case you wouldn't need ML at all but pattern matching combined with named entity recognition probably would do just fine.
I'm guessing he has a bunch of restaurant menus that he needs to read and categorize the items into appetizers, entrees, dessert, etc. etc. The problem is that he'll need 50,000 menus to train the ML model on, and then another 50,000 to verify it.
He'll also run into the problem of restaurants sometimes categorizing entrees as appetizers (is a caeser salad an entree or an appetizer?) so the NLP portion will be especially difficult. What about dish names that are in other languages?
You hit the nail on the head; for many straight information extraction problems, the universe of documents you want to extract could be too small to learn a model from -- or for that matter, for conventional methods of model evaluation to apply. (You want to extract data from all the items, not prove you reached a certain level of accuracy on a sub-sample of them)
where you can throw in a great amount of unlabeled data and build an internal representation that models the data well enough that you can train something that works like an HMM or CRF with a tiny amount of labeled data.
If you are willing to do something rule-based, I've used
to organize the work in annotating corpuses. Often I can prove that a certain rule set covers X% of the cases, then add a rule to do X+epsilon% until the results are "good enough".
Feel free to click on my profile link and send me a message if you want to chat more.
I would suggest using an ontology, or rolling your own from the English Wikipedia database dump, as a basis for tokenization of the menu text and go from there. What structured content exactly are you trying to extract?