|
|
|
|
|
by thorax
6647 days ago
|
|
I use BeautifulSoup when needed for simple scraping. My biggest frustrations, right now, are really around getting data from lots of different websites in subtly varied forms. This is a tough problem to automate. I certainly haven't found any tools that make it simple. I'd be happy with a 50% correctness rate, looking for very loose patterns. I just haven't found a tool and, while I have some ideas for how to do it, it's a major project in itself to produce something that can do this. For example, imagine writing a scraper that would parse out every food recipe online. Whether it be in forums, blogs, etc, etc. That's the sort of scraping I'm looking for and the best I'd have is putting together a neural network or other system that I can train against human-provided data. Unfortunately getting such a system to partition the text to just the recipe would be difficult. |
|
After that it just becomes an issue of removing the cruft around the recipe. I would start with common stuff: splitting things up by <br> or inner <p> since if someone is gonna have something before / after their recipe (say, on a forum) it'll be split up with blank lines somehow (well, usually). This will be another time to use things like close matching and teaching the algorithm what it gets right/wrong so it can weigh things as recipe/not better in the future.
If you do all this and add more specific edge cases as time goes on, I think you'd be able to maintain a 50% correctness rate pretty easily.
Edit: And it'd be much cheaper than a neural network ;)