Using Perl to scrape the web | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Using Perl to scrape the web (ssscripting.blogspot.com)
	27 points by geoscripting 6085 days ago

4 comments

kunley 6085 days ago

Mechanize for Perl & Ruby is quite cool.

However, it's not able to execute JavaScript. The only lib I found which does so with a reasonable subset of JS is HttpUnit, in Java. Though it has kind of ugly interface IMO, I use it with a success. Doing it from a Clojure REPL makes it quite handy tool for web scripting.

draegtun 6085 days ago

To use Javascript then try the CPAN module WWW::Selenium (http://search.cpan.org/dist/Test-WWW-Selenium/).

kunley 6085 days ago

Yeah I considered it but Selenium uses Firefox, does it? And I needed a self-contained script without such dependencies.

draegtun 6069 days ago

No, Selenium works with all major browsers.

geoscripting 6085 days ago

I used Selenium or WebDriver for that. Not bad.

regularfry 6085 days ago

Also available to JRuby as Celerity, which rocks.

kunley 6085 days ago

Looks cool, thanks!

Freebytes 6085 days ago

Perl would certainly be my language of choice for screen scraping. And, some people see it almost like stealing. I know of people that look at Google negatively for its news indexing method. I have skimmed through a book (though I do not remember the name) that seemed to claim that Google profits only from the work of others. (I think they added value in their collaboration of information, though.) Nonetheless, you must be careful not to create a backlash (or legal issue) with screen scraping. The irony is that one of the best targets for screen scraping content for your own benefit may be Google itself... however, it almost seems like they encourage it. (They want you to use their APIs instead, though.)

Spidering has been around for a long time, and people act like screen scraping is new. It is really the same that has existed for years. If you are going to do it, though, Perl is certainly the way to go. It is fast, efficient, and robust.

pbhjpbhj 6085 days ago

Theft requires that the taking denies the current owner access/use to whatever was taken. Copyright infringement appears to be what you're referring too.

Google IMO is more of a symbiont than a parasite.

Freebytes 6085 days ago

You are correct, and I agree. It is not theft, and Google is really a huge collection of mitochondria... helping the fledgling Internet become something more by combining it with an intellectual powerhouse. I could not have said it any better myself.

mahmud 6085 days ago

Perl has been the language of choice for web spidering since 1999, when it replaced REBOL for that purpose. Hint: LWP module made such a dent on the industry, nobody was able to replace it until the last year or two when other things started popping up.

martian 6085 days ago

This line seems problematic:

  use strict;

Most of the web is messy. Beautiful Soup and its ilk would seem like a better choice for parsing.

gloob 6085 days ago

Ahem.

http://www.perl.com/doc/manual/html/lib/strict.html

martian 6085 days ago

Ouch, should have RTFM. Thanks for the pointer.

geoscripting 6085 days ago

strict is a standard perl module. HTML::TreeBuilder seems to work just as well with malformed HTML.