| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danohuiginn 6645 days ago

nod I've thought about this a fair amount, too. You can do a lot to, say, figure out which pages contain recipes, even identify the structured information like ingredient lists (they're just lists full of foodstuffs and quantities). But IME it all falls apart when you need to find a block of text - like the descriptive part of the recipe. That's rarely marked up very clearly, and tends to blend into the rest of the text. So you either miss parts of the recipe, or pick up chunks of junk from the rest of the page.

That said, it's likely do-able, as long as you don't need perfect results. There are plenty of sites around that seem to be doing things along these lines - but AFAIK none of them have open-sourced their code.

Meanwhile, I've been a coward and stuck to beautiful soup for my scraping projects. In the short term, it works out faster than trying to be too clever.