| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fallentimes 6645 days ago

Getting just the recipe would be the hardest part, but it's still doable. Once you figure out that you're currently parsing a recipe (via keywords, close matching, whatever) you could fan out and look for common start/end tags like <p>, <div>, etc. If you use something like Beautiful Soup you could do this pre-parse instead of post-parse and eliminate a lot of extra stuff (no recipes in the <head> tag, etc.)

After that it just becomes an issue of removing the cruft around the recipe. I would start with common stuff: splitting things up by <br> or inner <p> since if someone is gonna have something before / after their recipe (say, on a forum) it'll be split up with blank lines somehow (well, usually). This will be another time to use things like close matching and teaching the algorithm what it gets right/wrong so it can weigh things as recipe/not better in the future.

If you do all this and add more specific edge cases as time goes on, I think you'd be able to maintain a 50% correctness rate pretty easily.

Edit: And it'd be much cheaper than a neural network ;)