Hacker News new | ask | show | jobs
Ask HN: Does anyone have experience scraping dynamic content?
3 points by sid_viswanathan 5051 days ago
http://www.nfl.com/scores

For example, for this URL, there is a section on the right for "Big Play Highlights" and the first one listed is "L.Brown 7-yard TD pass from..."

It looks like this data is being loaded via some kind of AJAX call. Do you have any ideas on how I can scrape this stream of highlights data? I've never tried to scrape any dynamic content in the past.

Ideas?

4 comments

PhantomJS/CasperJS is what I've used for my current scraping projects. They're headless browsers and imitate a browser session. Just specify a fake user agent like Mozilla, and you're good to go.
You MAY be able to do this with Selenium, although I've never used Selenium to scrape streaming requests. It's going to depend quite a bit on how the page is structured.

I have used Selenium to scrape dynamic content before by waiting for new DOM elements to be populated by AJAX, so I know this sort of thing is possible in a way you couldn't do with wget.

yea, just look at the http headers, here's the call: http://www.nfl.com/liveupdate/scores/bigPlayVideos.json?rand...

just make your own tstamp for random

HtmlUnit has very good support for javascript. http://htmlunit.sourceforge.net/