| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JackC 5171 days ago

For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination:

  import httplib2, lxml, pyquery
  h = httplib2.Http(".cache")
  def get(url):
      resp, content = h.request(  url, headers={'cache-control':'max-age=3600'})
      return pyquery.PyQuery( lxml.etree.HTML(content) )

This gives you a little function that fetches any URL as a jquery-like object:

  pq = get("http://foo.com/bar")
  checkboxes = pq('form input[type=checkbox]')
  nextpage = pq('a.next').attr('href')

And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.

Just something else to throw in the toolbelt ...

1 comments

the_cat_kittles 5171 days ago

Have checked out kenneth reitz's requests? Its fantastic, you might like it

link

codehenge 5171 days ago

Link, for the interested:

https://github.com/kennethreitz/requests

link

the_cat_kittles 5171 days ago

thanks, i should have included a code sample too:

    import requests
    from lxml import etree

    jquery_like_page = etree.HTML(requests.get('url').text)

link