Hacker News new | ask | show | jobs
by zenocon 4679 days ago
I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.
3 comments

PhantomJS + CasperJS is definitely the way to go when scraping data from complex pages. It's also great for circumventing bot detection. :)
I find scrapy (python) to be more robust for large scale scraping. There are cases where you want/need the javascript action and that's when you need a real browser. Otherwise the rendering would just slow things down.
Does this help in scraping website which provide data via jquery ? I mean does this render the javascript on page ?
Yes. It interprets and executes JS like a real browser would. Which is nice. For Python: http://jeanphix.me/Ghost.py/