Hacker News new | ask | show | jobs
by samarudge 4847 days ago
PyQuery is pretty awesome (https://pypi.python.org/pypi/pyquery)

Using Requests to download the document, pump it into PyQuery and you can use any jQuery style selectors to get text, attributes and all sorts of other stuff.

Example; Here's how to scrape the hacker news homepage https://gist.github.com/samarudge/035ab8aaca224415cb49 (that code could probably be improved but I only spent a couple of minutes on it)

4 comments

Watch out for unicode when using pyquery and requests. I provided a fix for that just recently now merged into the pyquery repo. I use it (among other things) to scrape upcoming comic book releases =) http://cuppster.com/2013/01/30/decorators-scrapers-and-gener...
I also prefer PyQuery over Beautiful Soup.

Especially since you can use a Chrome or FFX plugin to inject jQuery into any webpage and then refine your selector via the JavaScript console.

All you need to do then is to copy the selector in your python script and you are done.

I definitely recommend this for people used to the jquery syntax. Requests + PyQuery took no time at all to learn and did everything I needed for some basic page crawling.
PyQuery seems to always be faster in my experience than BS4 (for ripping the same information). Anyone else have a similar experience?
Only on wellformed pages. There are many many many many many malformed pages on the internet. Even those that are created in 2013
Fortunately HTML5 defined a standard way to parse even broken HTML and that parser is implemented in html5lib package. You can use it also with lxml and even use "jQuery like" selectors with lxml.cssselect (http://lxml.de/cssselect.html)