Hacker News new | ask | show | jobs
by twoodfin 1486 days ago
This is just so basic a screwup though. The W3C spec for XML has had a formal syntactic description of valid tag names for decades:

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-common-sy...

Plenty of libraries get this right because it’s so easy. You’d almost have to try—probably by being “clever”—to get it wrong.

2 comments

While I'm not defending the screw-up here - it's bad - it does do it a slight injustice to omit that the issue was not something simplistic around ascii/utf8 parsing but rather failing to reject/escape malformed-UTF8 strings. Unicode handling even in actual programming language implementations is an extremely common and well-documented problem.
I think it's worth remembering that XML parsing is also a big historic source of bugs which suggests to me that while it may look simple and well formed on the surface it's probably a lot harder than it looks.
Could you give examples? There were plenty of problems with certain standards layered atop of XML or self-made implementations of XML parsers and unparsers [1], but there is also a well tested set of standard compliant XML libraries that avoid those issues.

[1]: An internationally known consulting firm, that I won't name, had (perhaps has) an internal tool that compiles an Excel description of a service interface into actual XML parsing code that accepts only one hard-coded namespace alias for each given namespace. Over the years I've come across multiple companies with that bug in some service. Everytime I looked into it, the reason was the same internal tool of that consulting firm. And I've met multiple times people who had already discovered that same thing.

I have the same question as the sibling commenter: are you sure you mean parsing (i.e. well-formedness) and not handling (i.e. logic to do things with the parsed data: e.g. xxe, namespace separation, etc.

Obviously all software has some bugs and I'm sure XML parsers are no exception but I haven't been personally aware of any high profile ones before this.

For a quick example of a lowish-level XML bug that isn't parsing-related, I reported a bug many years ago in a piece of software whereby attributes without curie prefixes were being placed into the wrong namespace. A weird quirk of the XML spec is that unprefixed tags go into the default namespace but unprefixed attributes go into a "NULL" namespace (or, if I recall correctly, sometimes a specific namespace depending on the tag?). That's not a parser bug though since the parser has parsed the tag, attributes and associated prefix strings (or lack thereof) correctly: it just does something wrong post-parsing.

I feel like that class of bug is very common with XML, but it's more of an application stability concern than a security one (XXE being a notable exception just because it deals with IO)

IMO the best response to this kind of analysis is to humbly realize that any of us, working under real-world pressures, could make such a screw-up, and contemplate how we'll remain vigilant and mitigate the damage that comes from our inevitable screw-ups.