|
|
|
|
|
by hansvm
2104 days ago
|
|
I've seen a few academic papers and a few closed products that convert your selection of content you care about into a scraper capable of acquiring that content in the future. Last I checked there wasn't anything readily available as a FOSS library for doing so. I'm having trouble finding those papers at the moment, but here are a couple commercial products that sound similar in spirit to what you're describing. https://scraper.ai/ https://www.diffbot.com/ (kind of) Edit: I hadn't searched recently enough. See the sibling comment recommending this library. Haven't used it yet, but at first glance it looks nice. https://github.com/alirezamika/autoscraper/ |
|
(1) Its wrapper generation code isn't much more advanced than that similar data will be similarly nested in similar parent blocks. It looks more brittle than I'd like.
(2) It has zero tests, comments, docstrings, types, or any other niceties so far (and minimal documentation).
(3) When things go wrong it strongly prefers returning no information and not throwing any errors. None of the examples in the README actually run (or rather, they give you a `None` response that's all but useless) without changes.