Hacker News new | ask | show | jobs
by thorax 6647 days ago
I use BeautifulSoup when needed for simple scraping.

My biggest frustrations, right now, are really around getting data from lots of different websites in subtly varied forms. This is a tough problem to automate. I certainly haven't found any tools that make it simple.

I'd be happy with a 50% correctness rate, looking for very loose patterns. I just haven't found a tool and, while I have some ideas for how to do it, it's a major project in itself to produce something that can do this.

For example, imagine writing a scraper that would parse out every food recipe online. Whether it be in forums, blogs, etc, etc. That's the sort of scraping I'm looking for and the best I'd have is putting together a neural network or other system that I can train against human-provided data. Unfortunately getting such a system to partition the text to just the recipe would be difficult.

2 comments

Getting just the recipe would be the hardest part, but it's still doable. Once you figure out that you're currently parsing a recipe (via keywords, close matching, whatever) you could fan out and look for common start/end tags like <p>, <div>, etc. If you use something like Beautiful Soup you could do this pre-parse instead of post-parse and eliminate a lot of extra stuff (no recipes in the <head> tag, etc.)

After that it just becomes an issue of removing the cruft around the recipe. I would start with common stuff: splitting things up by <br> or inner <p> since if someone is gonna have something before / after their recipe (say, on a forum) it'll be split up with blank lines somehow (well, usually). This will be another time to use things like close matching and teaching the algorithm what it gets right/wrong so it can weigh things as recipe/not better in the future.

If you do all this and add more specific edge cases as time goes on, I think you'd be able to maintain a 50% correctness rate pretty easily.

Edit: And it'd be much cheaper than a neural network ;)

nod I've thought about this a fair amount, too. You can do a lot to, say, figure out which pages contain recipes, even identify the structured information like ingredient lists (they're just lists full of foodstuffs and quantities). But IME it all falls apart when you need to find a block of text - like the descriptive part of the recipe. That's rarely marked up very clearly, and tends to blend into the rest of the text. So you either miss parts of the recipe, or pick up chunks of junk from the rest of the page.

That said, it's likely do-able, as long as you don't need perfect results. There are plenty of sites around that seem to be doing things along these lines - but AFAIK none of them have open-sourced their code.

Meanwhile, I've been a coward and stuck to beautiful soup for my scraping projects. In the short term, it works out faster than trying to be too clever.