Hacker News new | ask | show | jobs
by Abundnce10 3030 days ago
I have a new project at work: I need to take in a free form text of recipe ingredients (e.g. "1/2 cup diced onions", "two potatoes, cut into 1-inch cubes", etc.) and build a program that identifies the ingredient (e.g. onion, potato), as well as the quantity (e.g. 0.5 cup, 2.0 units). Would machine learning be an applicable approach to solving this? Right now I'm just planning on using an NLP library to parse out the various parts of the ingredient text.
5 comments

I did the same a while back, and i suggest using an NLP library to extract parts of speech and parse trees and building a quick dirty solution. I did the same a while back and the strong solution isn't much better (took a week+) than the hacky manual one based on specific keywords like "teaspoon" and parts of speech/parse trees (took a few hours).
It's not very sexy, but I think you might find it easier and more robust just to use an NLP library.

I built something similar (albeit for a relatively limited database of recipes) for a hackathon a couple of weeks back. I didn't even use a proper NLP library, just some simple hand-rolled pattern-matching, and got pretty good results.

Good luck!

I think you're right. Did you happen to open-source your code from the hackathon? I'd love to take a look at your approach if you don't mind.
Sorry, I normally would but one of the other team members is considering taking the hack forward and wanted to keep it closed for now. (It's hard to see how much competitive advantage he'd have from 48 hours of very-hacked-together code, but so few hackathon projects get taken forward that I didn't want to discourage him!)

The approach was to tokenize the input and then do basic pattern-matching on it, with separate dictionaries of quantity units (e.g. cup, oz, pound) ingredients, processing words (e.g. "chopped") and throw-away words (e.g. "of"). In fact, possibly the most complicated part was parsing "2.5", "2 and a half" and "2½" all to the same thing.

Whether you end up using a machine learning approach or hand-crafting the solution, I recommend you work in a ML-like manner, dividing up the data you have into test and training sets and using cross-validation to evaluate your work.

For you actual question, yes, as others have said it might be just an NLP/regexp problem. Otherwise, you could look at ingredients identification as a classification approach. I recommend checking FastText, NLTK, familiarize yourself with word dictionaries and pre-trained vectors that are available, these tools might help generalize your work beyond the data you have at hand.

(E.g. if it works well on your data using pre-trained word vectors from wikipedia, chances are it might work on examples you don't even have.)

This is an NLP problem if all you're trying to do is extract nouns.