| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 2151 days ago

I've seen a few academic papers and a few closed products that convert your selection of content you care about into a scraper capable of acquiring that content in the future. Last I checked there wasn't anything readily available as a FOSS library for doing so.

I'm having trouble finding those papers at the moment, but here are a couple commercial products that sound similar in spirit to what you're describing.

https://scraper.ai/

https://www.diffbot.com/ (kind of)

Edit: I hadn't searched recently enough. See the sibling comment recommending this library. Haven't used it yet, but at first glance it looks nice. https://github.com/alirezamika/autoscraper/

1 comments

hansvm 2151 days ago

I've tried autoscraper now, and I don't like it (not yet).

(1) Its wrapper generation code isn't much more advanced than that similar data will be similarly nested in similar parent blocks. It looks more brittle than I'd like.

(2) It has zero tests, comments, docstrings, types, or any other niceties so far (and minimal documentation).

(3) When things go wrong it strongly prefers returning no information and not throwing any errors. None of the examples in the README actually run (or rather, they give you a `None` response that's all but useless) without changes.

link

junilop 2151 days ago

despite being undocumented, it really works well. I tried the readme examples and all work. maybe you didn't update the wanted list in the examples because it has changed in the page. IMO the biggest problem is the lack of js enabled content support.

link

hansvm 2151 days ago

> it really works well

It works well enough. I tried a few other sites and had mixed results even when providing the raw HTML so that I knew its http logic wasn't the issue.

> maybe you didn't update the wanted list in the examples

Yeah, that was my only real problem with the readme examples. Those could just as easily be provided as local data (e.g. how `sklearn.datasets` works) so that the end user starts with working code, especially since there are no errors/warnings/etc when anything goes wrong.

> IMO the biggest problem is the lack of js enabled content support.

Haha, unless I'm seriously misunderstanding you this is one of the only things I don't mind :) Since you can pass raw html to the library, you can use your favorite headless browser to navigate (or in the happy case just load a non-interactive js-enabled site) to your content and pass it through to this library to do the data extraction. I rather like those features being decoupled and kind of wish this library didn't attempt to do any of the crawling itself. I know that's just a personal preference, but it's my account, so I'll say what I like about it.

link

junilop 2151 days ago

didn't see that feature, makes sense now :)

link