| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by foodawg 6649 days ago

I do all my screen scraping with PHP, curl, and some regex. Previously I used plain PHP.

I use it to scrape television listing data (http://ktyp.com/rss/tv/ was my old site, and http://code.google.com/p/listocracy/) and more recently to scrape resume data from job posting websites for a (YC-rejected :P ) side project I'm working on.

The hardest part I've encountered with scraping is odd login and form setups. For example Monster.com uses an outside script to attempt to fool scraping. A couple other sites use bizarre redirecting across pages. Also AJAX certainly has changed the way a lot of screen scraping is done.

Finally, the most useful tool I've used is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/) which is great for following how a site operates.

Edit: For PHP, another interesting tool for scraping is htmlSQL (http://www.jonasjohn.de/lab/htmlsql.htm) which allows HTML to be searched using SQL like syntax.

1 comments

m0nty 6649 days ago

"Also AJAX certainly has changed the way a lot of screen scraping is done."

I'd be interested in how you tackle this one. I've always used something like Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up. I've had moderate success using GreaseMonkey and regexps in JavaScript code, but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since that should allow me to select DOM elements very easily. But if you have a better way, please share :)

link

alex_c 6649 days ago

Even though it's actually a testing tool, you might have some luck with Canoo Webtest + Groovy (http://webtest.canoo.com). Webtest uses HtmlUnit which has pretty good Javascript support, and means you don't have to mess with regexps to get around the document structure, and Groovy lets you use an actual programming language rather than the awkward Ant-based syntax of Webtest. It takes some getting used to, and I haven't used it for web scraping, but it's a pretty powerful combination.

link

m0nty 6649 days ago

Thanks, I'll give it a try. I'm collaborating on a project which involves getting info from online financial markets, btw, but it's getting held up because of this scraping problem. So new ideas might help get it moving again.

link