Hacker News new | ask | show | jobs
by raesene3 4446 days ago
Interesting to see this hit big companies like google. The problem, I think, stems from the idea that most people treat XML parsers as a "black box" and don't enquire too closely as to all the functionality that they support.

Reading the spec. which led to the implementations, can often reveal interesting things, like support for external entities..

3 comments

I would say the flaw is that XML parsers will try to resolve external entities on their own, by resolving file paths or whatever. They shouldn't do this by default: they should instead take a programmer-supplied entity resolver and call into that.

They could also provide a canned resolver which hits the local filesystem and/or the web, which programmers could supply if they wanted, but this should not be a default. The programmer should have to explicitly specify that access.

I've had related problems where XML parsers would try to go off and fetch DTDs from the web, then fail, because they were running on firewalled machines that couldn't see the servers hosting the DTDs. That took us by surprise. We installed an entity resolver that looked in a local cache of DTDs instead, which was fairly easy. But i would prefer not to have been surprised.

Also, all this stuff should be running in a jail where it can't even see any interesting files, of course.

> They shouldn't do this by default: they should instead take a programmer-supplied entity resolver and call into that.

Then the programmers would write their own resolvers with even more bugs most probably. You would have 10 000 broken implementations of that code, half of them copied from stackoverflow example with security left as exercise for reader.

You could have a default implementation that callers have to set, eg:

    xmlSetFileResolver (xml, xmlDefaultFileResolver);
Callers could provide their own, but most will use none or use the supplied default.

Of course nothing helps for people who code by copying and pasting, rather than understanding what the API or library does.

Also horrible defaults in XML parsers. That any XML parsers allow retrieval of DTD's without explicit options specifying allowed sources etc. is beyond me. It's not just local file access, which becomes a security hole when you let users pass you XML files, though that is one of the worst ones.

But the number of times I've seen production apps that turn out to behind the scenes request DTD's or schemas from remote servers regularly have made that one of the first thing I check if I am tasked to maintain or look into anything that parses XML. Often these apps stop working or slow down for seemingly no reason because the DTD or schema becomes unavailable, and nobody understands why.

The crazy part about this is that I remember having these conversations over a decade ago and it was very clearly recognized as a major security, reliability and performance problem but the greater XML community basically just shrugged it off.

One really interesting aspect of this is that many applications suddenly broke when the Republicans shut down the government last year because a number of XML schemas are managed by government agencies who were suddenly legally unable to provide their normal web services:

http://gis.stackexchange.com/a/73777 http://forums.arcgis.com/threads/94294-Expected-DTD-markup-w... http://www.catalogingrules.com/?p=77

Makes me wonder whether it's time to start contributing patches to disable bad ideas like this by default — some places are clearly paying a significant amount to serve content nobody should need: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dt...

It's bad practice to fetch an external DTD on a server you don't control, first for security reasons, second because your application then depends on something that can go away anytime, third because it's rude to the third party.

twic is right that one should always use entity resolvers that point to local ressources and that parsers should run in a sandbox without external access.

He's also right to say that by default parsers shouldn't go fetch external resources; I think the reason is historical; entity resolvers appeared later than the parsers themselves.

It is bad practise but you know that it is uncannily common?

Just remember that the W3C had to impose download restrictions on the (X)HTML DTDs (http://www.w3.org/Help/Webmaster#block)

I agree, this can be summarised as "abstraction hides bugs". I believe that although abstraction is a powerful tool, there is such a thing as too much of it, and when reading an XML document can cause access to other files, maybe even across the network, perhaps things have gone a little too far. This isn't like an obvious #include or @import, it's much more subtle.

When I first noticed that HTML doctypes have URLs in them, I inquisitively tried accessing them, and it brought up a lot of questions in my mind about why it was designed that way, what would happen if the URLs no longer existed, etc. Such an explicit external dependency just didn't feel right to me. Unfortunately most people either don't notice or seem to ignore these things...

Interestingly enough, not all XML parsers support external entities; the first one to come to mind is this: http://tibleiz.net/asm-xml/introduction.html

They are supposed to be identifiers and not resolved. But using http for something not to be resolved is odd...