Hacker News new | ask | show | jobs
by crazygringo 2105 days ago
I've long wanted a really robust way of defining page areas for scraping, that could handle even relatively major HTML shifts.

My best idea has been to simply maintain a collection of "reference" URL's (e.g. of different products or articles) and identify unique start/end text for those specific instances.

Then automatically extract as many possible different "rules" for locating the desired content (pure structure and ordering, class hierarchies, classes/ids, surrounding text, etc.) and find the ones that are consistent across different instances.

And then just use those rules until they break on the reference page... and when they break, develop new ones.

I'm curious if anyone's built this type of thing?

2 comments

I've seen a few academic papers and a few closed products that convert your selection of content you care about into a scraper capable of acquiring that content in the future. Last I checked there wasn't anything readily available as a FOSS library for doing so.

I'm having trouble finding those papers at the moment, but here are a couple commercial products that sound similar in spirit to what you're describing.

https://scraper.ai/

https://www.diffbot.com/ (kind of)

Edit: I hadn't searched recently enough. See the sibling comment recommending this library. Haven't used it yet, but at first glance it looks nice. https://github.com/alirezamika/autoscraper/

I've tried autoscraper now, and I don't like it (not yet).

(1) Its wrapper generation code isn't much more advanced than that similar data will be similarly nested in similar parent blocks. It looks more brittle than I'd like.

(2) It has zero tests, comments, docstrings, types, or any other niceties so far (and minimal documentation).

(3) When things go wrong it strongly prefers returning no information and not throwing any errors. None of the examples in the README actually run (or rather, they give you a `None` response that's all but useless) without changes.

despite being undocumented, it really works well. I tried the readme examples and all work. maybe you didn't update the wanted list in the examples because it has changed in the page. IMO the biggest problem is the lack of js enabled content support.
> it really works well

It works well enough. I tried a few other sites and had mixed results even when providing the raw HTML so that I knew its http logic wasn't the issue.

> maybe you didn't update the wanted list in the examples

Yeah, that was my only real problem with the readme examples. Those could just as easily be provided as local data (e.g. how `sklearn.datasets` works) so that the end user starts with working code, especially since there are no errors/warnings/etc when anything goes wrong.

> IMO the biggest problem is the lack of js enabled content support.

Haha, unless I'm seriously misunderstanding you this is one of the only things I don't mind :) Since you can pass raw html to the library, you can use your favorite headless browser to navigate (or in the happy case just load a non-interactive js-enabled site) to your content and pass it through to this library to do the data extraction. I rather like those features being decoupled and kind of wish this library didn't attempt to do any of the crawling itself. I know that's just a personal preference, but it's my account, so I'll say what I like about it.

didn't see that feature, makes sense now :)
This project was recently submitted to r/python: https://github.com/alirezamika/autoscraper/