|
Theory-wise, there's regular languages, context-free grammars, and combinatorial categorial grammars ( http://openccg.sf.net ). But regular + lists seems adequate for most tasks. What sorts of scraping do you find yourself doing? What are your biggest frustrations? What's the coolest hack you've encountered while scraping? My cofounder and I have been working on a domain-specific language to make scraping quick and easy, so that you can write, say, 100 different website scrapers in less time -- http://dartbanks.com/simplescrape . We'd love feedback on this approach. |
I use it to scrape television listing data (http://ktyp.com/rss/tv/ was my old site, and http://code.google.com/p/listocracy/) and more recently to scrape resume data from job posting websites for a (YC-rejected :P ) side project I'm working on.
The hardest part I've encountered with scraping is odd login and form setups. For example Monster.com uses an outside script to attempt to fool scraping. A couple other sites use bizarre redirecting across pages. Also AJAX certainly has changed the way a lot of screen scraping is done.
Finally, the most useful tool I've used is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/) which is great for following how a site operates.
Edit: For PHP, another interesting tool for scraping is htmlSQL (http://www.jonasjohn.de/lab/htmlsql.htm) which allows HTML to be searched using SQL like syntax.