|
|
|
|
|
by crazygringo
2105 days ago
|
|
I've long wanted a really robust way of defining page areas for scraping, that could handle even relatively major HTML shifts. My best idea has been to simply maintain a collection of "reference" URL's (e.g. of different products or articles) and identify unique start/end text for those specific instances. Then automatically extract as many possible different "rules" for locating the desired content (pure structure and ordering, class hierarchies, classes/ids, surrounding text, etc.) and find the ones that are consistent across different instances. And then just use those rules until they break on the reference page... and when they break, develop new ones. I'm curious if anyone's built this type of thing? |
|
I'm having trouble finding those papers at the moment, but here are a couple commercial products that sound similar in spirit to what you're describing.
https://scraper.ai/
https://www.diffbot.com/ (kind of)
Edit: I hadn't searched recently enough. See the sibling comment recommending this library. Haven't used it yet, but at first glance it looks nice. https://github.com/alirezamika/autoscraper/