| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hafabnew 4682 days ago
	Cool, although LXML (mature, fast, reliable, generally awesome) can do this (and more): http://lxml.de/FAQ.html#how-can-i-map-an-xml-tree-into-a-dic... It also has the 'objectify' API, where you can access XML nodes via regular object access (i.e. `access.nodes[0].like.this`). http://lxml.de/objectify.html

8 comments

wcarss 4682 days ago

I have two major gripes with lxml that this library solves, but I agree that for serious projects, lxml is the correct choice.

1) you have to build libxml to use lxml

2) lxml has a large, powerful, complicated API

For 1), A friend of mine had to do an annoying workaround to use lxml on his box, due to its limited memory preventing him from being able to build libxml. Because xmltodict is Expat[1]-based, you don't have to build libxml in your environment to use it.

For 2), When I went to write a simple rss-reader project this past weekend, I dreaded going back to lxml. I knew that I'd have to go peruse its huge documentation to answer questions about whether to use lxml.XML or lxml.fromstring, whether methods I wanted were on Elements or ElementTrees, xpath syntax, custom parsers, etc. If I'd ever seen the objectify API I'd forgotten about it, because there's just so much _other_ stuff in lxml.

I happen to have found xmltodict in a brief search for lxml alternatives. It's in PyPI, so pip grabbed it with no complaints. It installed without building anything. And in less than a minute of glancing at the README, I grokked the API as "pydict = xmltodict.parse(xml_string)". I don't know if there are other things. I never had to find out.

Less than 10 minutes from finding it to forgetting I was reading XML as a source -- really a wonderful project. But if I were doing something 'serious', I'd absolutely use lxml. That large API and those byzantine docs exist for a good reason: they're dealing with XML properly. But sometimes coders just wanna have fun, or build a quick prototype or hack.

[1] http://docs.python.org/2/library/pyexpat.html

link

hafabnew 4682 days ago

Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath.

E.g. for extracting URLs from a sitemap:

    from lxml import etree

    root = etree.XML(data)

    urls = root.xpath(
        './/sitemap:loc/text()',
        namespaces={
            'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        }
    )

For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too.

    import urllib2

    from lxml import etree

    data = urllib2.urlopen('https://news.ycombinator.com/rss').read()

    root = etree.XML(data)

    for i, item in enumerate(root.xpath('.//item')):
        print i, item.xpath('title/text()')[0]
        print item.xpath('description/text()')[0]
        print item.xpath('link/text()')[0]
        print

All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them.

link

raverbashing 4682 days ago

Good, I had never seen this transformation tip in LXML

But usually, yes, LXML is "good" meaning it's the least worse way of dealing with XML

Also, it has some idiosyncrasies, like insisting on adding the namespace on tag names, so you end with something like {http://example.com/your.xlsd}.index (I don't remember it exactly and I don't have an example here with me)

link

toyg 4682 days ago

Correctly handling namespaced QNames is a requirement, not a bug. It makes things awkward at times, but that's a job for lib writers to provide decent interfaces.

I haven't used LXML in a while, but ElementTree, for example, forces you to use the QName in XPath expressions, which is technically correct but a huge pain; it would be nice if there was a ScrewNamespace option that would allow "simple" searches, although this might blow up in your face one day (when two namespaces define the same element name, and your xpath search brings up elements you didn't really want).

link

raverbashing 4682 days ago

Actually the short name worked for the searches, the problem was reading element names from a subtree

It's not as much as dodging the requirements but rather an inconsistency in the API.

link

TimSAstro 4682 days ago

I also found the namespacing to be a bit weird, and it took quite a while to grok the documentation. In case anyone wants a working example, I implemented a wrapper to drop the namespacing (resulting in simple objectify attribute access) for one particular XML schema here: https://github.com/timstaley/voevent-parse/blob/master/voepa...

link

georgebashi 4682 days ago

That's the "node-name", i.e. the fully qualified name of the tag in its namespace. You probably wanted to ask for the local-name, which in your example would just be "index". Not sure how to with LXML, but it's a common mistake people make when dealing with XML.

link

martinblech 4682 days ago

There's however a small feature in xmltodict that most people overlook: the streaming mode. I actually wrote xmltodict the day I tried to parse a Wikipedia dump, I just couldn't keep it all in memory but needed something more high-level than SAX.

link

martinblech 4682 days ago

xmltodict is in no way trying to compete with LXML feature-wise (no support for namespaces yet, just to name one thing). It's just a lightweight approach to roundtrip between XML and JSON documents that worked for my use case and decided to share it.

link

drunkpotato 4682 days ago

Yes, my first thought when reading the headline was that it would be about LXML.

However, while LXML can do this, and makes it easy, the documentation does not stress this way of using LXML. I like this project's emphasis on simplicity and doing one thing. It's the difference between "You can use LXML to get a dict" vs "Here is how to use xmltodict to get a dict". And it's right there in the name. Emphasis and naming are important when getting started.

link

aidos 4682 days ago

LXML is a little unapproachable when you first use it but it's all of the other great things you mentioned too. Now that I've used it a lot I would never consider trading it for something simpler. It can handle any situation you're going to run into. I'd suggest that if people were looking to do anything more than a quick hack they invest an afternoon in learning the LXML API.

link

JimmaDaRustla 4682 days ago

I use it as the processor for BeautifulSoup! This should be the default IMHO - the default caused me lots of issues.

link

chernevik 4682 days ago

The objectify API looks a lot better than the __getattr__ hacks I've been using for this. Thanks.

link