| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by samarudge 4847 days ago

PyQuery is pretty awesome (https://pypi.python.org/pypi/pyquery)

Using Requests to download the document, pump it into PyQuery and you can use any jQuery style selectors to get text, attributes and all sorts of other stuff.

Example; Here's how to scrape the hacker news homepage https://gist.github.com/samarudge/035ab8aaca224415cb49 (that code could probably be improved but I only spent a couple of minutes on it)

4 comments

cuppster 4846 days ago

Watch out for unicode when using pyquery and requests. I provided a fix for that just recently now merged into the pyquery repo. I use it (among other things) to scrape upcoming comic book releases =) http://cuppster.com/2013/01/30/decorators-scrapers-and-gener...

link

Riesling 4846 days ago

I also prefer PyQuery over Beautiful Soup.

Especially since you can use a Chrome or FFX plugin to inject jQuery into any webpage and then refine your selector via the JavaScript console.

All you need to do then is to copy the selector in your python script and you are done.

link

papa_bear 4847 days ago

I definitely recommend this for people used to the jquery syntax. Requests + PyQuery took no time at all to learn and did everything I needed for some basic page crawling.

link

kimagure 4847 days ago

PyQuery seems to always be faster in my experience than BS4 (for ripping the same information). Anyone else have a similar experience?

link

chewxy 4847 days ago

Only on wellformed pages. There are many many many many many malformed pages on the internet. Even those that are created in 2013

link

ville 4847 days ago

Fortunately HTML5 defined a standard way to parse even broken HTML and that parser is implemented in html5lib package. You can use it also with lxml and even use "jQuery like" selectors with lxml.cssselect (http://lxml.de/cssselect.html)

link