|
|
|
|
|
by foodawg
6649 days ago
|
|
I do all my screen scraping with PHP, curl, and some regex. Previously I used plain PHP. I use it to scrape television listing data (http://ktyp.com/rss/tv/ was my old site, and http://code.google.com/p/listocracy/) and more recently to scrape resume data from job posting websites for a (YC-rejected :P ) side project I'm working on. The hardest part I've encountered with scraping is odd login and form setups. For example Monster.com uses an outside script to attempt to fool scraping. A couple other sites use bizarre redirecting across pages. Also AJAX certainly has changed the way a lot of screen scraping is done. Finally, the most useful tool I've used is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/) which is great for following how a site operates. Edit: For PHP, another interesting tool for scraping is htmlSQL (http://www.jonasjohn.de/lab/htmlsql.htm) which allows HTML to be searched using SQL like syntax. |
|
I'd be interested in how you tackle this one. I've always used something like Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up. I've had moderate success using GreaseMonkey and regexps in JavaScript code, but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since that should allow me to select DOM elements very easily. But if you have a better way, please share :)