Hacker News new | ask | show | jobs
by agumonkey 4464 days ago
It also depends on a coherent structure in HTML websites.

Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.

3 comments

"it screams for a public API"

But many content owners would never provide their data in this format even if doing-so would be trivial.

Try using https://snapsearch.io/ It is designed for JS sites.
These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.
Of course, I should have thought about it that way.