| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wcarss 4682 days ago

I have two major gripes with lxml that this library solves, but I agree that for serious projects, lxml is the correct choice.

1) you have to build libxml to use lxml

2) lxml has a large, powerful, complicated API

For 1), A friend of mine had to do an annoying workaround to use lxml on his box, due to its limited memory preventing him from being able to build libxml. Because xmltodict is Expat[1]-based, you don't have to build libxml in your environment to use it.

For 2), When I went to write a simple rss-reader project this past weekend, I dreaded going back to lxml. I knew that I'd have to go peruse its huge documentation to answer questions about whether to use lxml.XML or lxml.fromstring, whether methods I wanted were on Elements or ElementTrees, xpath syntax, custom parsers, etc. If I'd ever seen the objectify API I'd forgotten about it, because there's just so much _other_ stuff in lxml.

I happen to have found xmltodict in a brief search for lxml alternatives. It's in PyPI, so pip grabbed it with no complaints. It installed without building anything. And in less than a minute of glancing at the README, I grokked the API as "pydict = xmltodict.parse(xml_string)". I don't know if there are other things. I never had to find out.

Less than 10 minutes from finding it to forgetting I was reading XML as a source -- really a wonderful project. But if I were doing something 'serious', I'd absolutely use lxml. That large API and those byzantine docs exist for a good reason: they're dealing with XML properly. But sometimes coders just wanna have fun, or build a quick prototype or hack.

[1] http://docs.python.org/2/library/pyexpat.html

1 comments

hafabnew 4682 days ago

Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath.

E.g. for extracting URLs from a sitemap:

    from lxml import etree

    root = etree.XML(data)

    urls = root.xpath(
        './/sitemap:loc/text()',
        namespaces={
            'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        }
    )

For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too.

    import urllib2

    from lxml import etree

    data = urllib2.urlopen('https://news.ycombinator.com/rss').read()

    root = etree.XML(data)

    for i, item in enumerate(root.xpath('.//item')):
        print i, item.xpath('title/text()')[0]
        print item.xpath('description/text()')[0]
        print item.xpath('link/text()')[0]
        print

All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them.