| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hafabnew 4680 days ago

Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath.

E.g. for extracting URLs from a sitemap:

    from lxml import etree

    root = etree.XML(data)

    urls = root.xpath(
        './/sitemap:loc/text()',
        namespaces={
            'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        }
    )

For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too.

    import urllib2

    from lxml import etree

    data = urllib2.urlopen('https://news.ycombinator.com/rss').read()

    root = etree.XML(data)

    for i, item in enumerate(root.xpath('.//item')):
        print i, item.xpath('title/text()')[0]
        print item.xpath('description/text()')[0]
        print item.xpath('link/text()')[0]
        print

All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them.