Hacker News new | ask | show | jobs
by hansvm 2115 days ago
I've tried autoscraper now, and I don't like it (not yet).

(1) Its wrapper generation code isn't much more advanced than that similar data will be similarly nested in similar parent blocks. It looks more brittle than I'd like.

(2) It has zero tests, comments, docstrings, types, or any other niceties so far (and minimal documentation).

(3) When things go wrong it strongly prefers returning no information and not throwing any errors. None of the examples in the README actually run (or rather, they give you a `None` response that's all but useless) without changes.

1 comments

despite being undocumented, it really works well. I tried the readme examples and all work. maybe you didn't update the wanted list in the examples because it has changed in the page. IMO the biggest problem is the lack of js enabled content support.
> it really works well

It works well enough. I tried a few other sites and had mixed results even when providing the raw HTML so that I knew its http logic wasn't the issue.

> maybe you didn't update the wanted list in the examples

Yeah, that was my only real problem with the readme examples. Those could just as easily be provided as local data (e.g. how `sklearn.datasets` works) so that the end user starts with working code, especially since there are no errors/warnings/etc when anything goes wrong.

> IMO the biggest problem is the lack of js enabled content support.

Haha, unless I'm seriously misunderstanding you this is one of the only things I don't mind :) Since you can pass raw html to the library, you can use your favorite headless browser to navigate (or in the happy case just load a non-interactive js-enabled site) to your content and pass it through to this library to do the data extraction. I rather like those features being decoupled and kind of wish this library didn't attempt to do any of the crawling itself. I know that's just a personal preference, but it's my account, so I'll say what I like about it.

didn't see that feature, makes sense now :)