Hacker News new | ask | show | jobs
by bdcravens 4683 days ago
I've done a lot of scraping. Some sites use heavy Javascript frameworks that generate session IDs and request IDs that the XHR requests use to "authenticate" the request. In these situations, the amount of work to reverse engineer that workflow is pretty large. In these situations, I lean on headless Selenium. I know there are some lighter solutions, but Selenium offers some distinct advantages:

1) lot of library support, in multiple languages

2) without having to fake UAs, etc, the requests look more like a regular user (all media assets downloaded, normal browser UA, etc)

3) simple clustering: setting up a Selenium grid is very easy, and switching from local instance of Selenium to using the grid requires very little code change (1 line in most cases)

1 comments

HtmlUnit† is also effective in such cases. HtmlUnit is intended to automate testing of websites. However, the very facilities that enable it to be useful for that purpose also make it useful for scraping.

A few years ago, I wanted to analyze retail store customer feedback data collected by a third party company. The stores were franchises, and the third party was anointed by the franchising company. The data was presented to the user (franchisee store management) via a fancy web site with its own opinion about how the data should be analyzed. My opinion differed. I wanted the data in low-level, RDBMS-friendly form, so that I could recast it every which way (and come back and do it again a new way I thought of). However, such was not forthcoming (big company, little franchisee).

The solution was to make a robot that put the third party company's portal through its paces at the finest granularity, scraping the numbers into a DB as they tediously appeared. The robot was in JRuby††, allowing access to HtmlUnit's functionality without the tedium of Java coding. It was slow, but I didn't care — run it overnight once a month, then run reports off the DB generated.

The coding approach was simple: Pretend you are a user. Access each page, starting with the login page, and do what the user would do. Scrape the interesting numbers as they appear. Append appropriate rows to the DB.

http://htmlunit.sourceforge.net/

††http://jruby.org/