Hacker News new | ask | show | jobs
Using Perl to scrape the web (ssscripting.blogspot.com)
27 points by geoscripting 6038 days ago
4 comments

Mechanize for Perl & Ruby is quite cool.

However, it's not able to execute JavaScript. The only lib I found which does so with a reasonable subset of JS is HttpUnit, in Java. Though it has kind of ugly interface IMO, I use it with a success. Doing it from a Clojure REPL makes it quite handy tool for web scripting.

To use Javascript then try the CPAN module WWW::Selenium (http://search.cpan.org/dist/Test-WWW-Selenium/).
Yeah I considered it but Selenium uses Firefox, does it? And I needed a self-contained script without such dependencies.
No, Selenium works with all major browsers.
I used Selenium or WebDriver for that. Not bad.
Also available to JRuby as Celerity, which rocks.
Looks cool, thanks!
Perl would certainly be my language of choice for screen scraping. And, some people see it almost like stealing. I know of people that look at Google negatively for its news indexing method. I have skimmed through a book (though I do not remember the name) that seemed to claim that Google profits only from the work of others. (I think they added value in their collaboration of information, though.) Nonetheless, you must be careful not to create a backlash (or legal issue) with screen scraping. The irony is that one of the best targets for screen scraping content for your own benefit may be Google itself... however, it almost seems like they encourage it. (They want you to use their APIs instead, though.)

Spidering has been around for a long time, and people act like screen scraping is new. It is really the same that has existed for years. If you are going to do it, though, Perl is certainly the way to go. It is fast, efficient, and robust.

Theft requires that the taking denies the current owner access/use to whatever was taken. Copyright infringement appears to be what you're referring too.

Google IMO is more of a symbiont than a parasite.

You are correct, and I agree. It is not theft, and Google is really a huge collection of mitochondria... helping the fledgling Internet become something more by combining it with an intellectual powerhouse. I could not have said it any better myself.
Perl has been the language of choice for web spidering since 1999, when it replaced REBOL for that purpose. Hint: LWP module made such a dent on the industry, nobody was able to replace it until the last year or two when other things started popping up.
This line seems problematic:

  use strict;
Most of the web is messy. Beautiful Soup and its ilk would seem like a better choice for parsing.
Ouch, should have RTFM. Thanks for the pointer.
strict is a standard perl module. HTML::TreeBuilder seems to work just as well with malformed HTML.