Hacker News new | ask | show | jobs
Ask HN: How to extract structured content from unstructured menus using NLP and ML?
27 points by restapi 3196 days ago
4 comments

This might not be a task for ML, especially assuming the only option would be unsupervised ML.

I would suggest using an ontology, or rolling your own from the English Wikipedia database dump, as a basis for tokenization of the menu text and go from there. What structured content exactly are you trying to extract?

What's the source format? If you're dealing with PDFs at least you have textual data, which could be matched against a recipe database. I haven't checked but services like Epicurious might offer an API for that.

In that case you wouldn't need ML at all but pattern matching combined with named entity recognition probably would do just fine.

can you provide an example link to the data.
I'm guessing he has a bunch of restaurant menus that he needs to read and categorize the items into appetizers, entrees, dessert, etc. etc. The problem is that he'll need 50,000 menus to train the ML model on, and then another 50,000 to verify it.

He'll also run into the problem of restaurants sometimes categorizing entrees as appetizers (is a caeser salad an entree or an appetizer?) so the NLP portion will be especially difficult. What about dish names that are in other languages?

I could go on... Tough problem!

You hit the nail on the head; for many straight information extraction problems, the universe of documents you want to extract could be too small to learn a model from -- or for that matter, for conventional methods of model evaluation to apply. (You want to extract data from all the items, not prove you reached a certain level of accuracy on a sub-sample of them)

One approach is

https://blog.openai.com/unsupervised-sentiment-neuron/

where you can throw in a great amount of unlabeled data and build an internal representation that models the data well enough that you can train something that works like an HMM or CRF with a tiny amount of labeled data.

If you are willing to do something rule-based, I've used

https://en.wikipedia.org/wiki/Case-based_reasoning

to organize the work in annotating corpuses. Often I can prove that a certain rule set covers X% of the cases, then add a rule to do X+epsilon% until the results are "good enough".

Feel free to click on my profile link and send me a message if you want to chat more.

I would recommend checking out Mechanical Turk
Yes humans have an advanced neural network that comes trained on all kinds of tasks.
If you deal with pdf documents, I might be able to help. Mail is in profile.