Hacker News new | ask | show | jobs
by ricardo81 2059 days ago
Done a fair bit of scraping in my time, mostly with PHP/curl and PHP's DOMDocument if necessary.

I'd say to anyone learning how to code it's a good exercise in learning. Think a scraper for most sites can be built in an hour or two, depending on navigation and how data is sent to the client.

Definitely noticed a trend towards XHR and JSON responses typically using a numeric ID. Probably the easiest type of site to scrape where you don't need to crawl navigation, simply iterate over a number range and the scraped data is already pretty much structured.

1 comments

Agreed. Though often I find sites and pages that need Chrome's flavor of JS. It's becoming increasingly inevitable one will need Chrome/ium to reliably get the rendered markup.
I've never really scraped anything where the valued data is in JS or dependent on a browser. Sometimes the browser uses JS to fetch the data, but generally the call is easily found out in your browser console. The patterns are generally obvious.