| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by twoodfin 1486 days ago

The XML parsing/validation bugs are, I suppose, not shocking, but deeply disappointing.

The one thing XML & its tooling were supposed to get right was document well-formed-ness. Sure, it might be a mess of a standard in other ways, but at least we could agree what a parser should and shouldn’t accept! (Not the case for the HTML tag soup of then or now.)

That, 25 years on, a popular XML processor can’t even meet that low bar for tag names is maddening.

2 comments

Diggsey 1486 days ago

There are just so many issues here.

1) Don't rely on two parsers having identical behaviour for security. Yes parsers for the same format should behave the same, but bugs happen, so don't design a system where small differences result in such a catastrophic bug. If you absolutely have to do this, at least use the same parser on both ends.

2) Don't allow layering violations. All content of XML documents is required to be valid in the configured character encoding. That means layer 1 of your decoder should be converting a byte stream into a character stream, and layers 2+ should not even have the opportunity to mess up decoding a character. Efficiency is not a justification, because you can use compile-time techniques to generate the exact same code as if you combined all layers into one. This has the added benefit that it removes edge-cases (if there is one place where bytes are decoded into characters, then you can't get a bug where that decoding is only broken in tag names, and so your test coverage is automatically better).

3) Don't transparently download and install stuff without user interaction, regardless of where it comes from!

4) Revoke certificates for old compromised versions of an installer so that downgrade attacks are not possible.

iancarroll 1486 days ago

> Revoke certificates for old compromised versions of an installer so that downgrade attacks are not possible.

Worth noting that Windows accepts signatures from revoked code signing certificates so long as it has a signed timestamped before the revocation.

hamandcheese 1486 days ago

….and I assume the revocation can’t be back-dated?

ComputerGuru 1486 days ago

timestamps must come from a globally recognized signed source, like digicert or verisign.

iancarroll 1486 days ago

The CA could backdate the CRL’s revocation timestamp if they wanted, but it seems unlikely and presumably it’s not allowed.

CaliforniaKarl 1486 days ago

> 4) Revoke certificates for old compromised versions of an installer so that downgrade attacks are not possible.

I suggest the following alternative: When your own software is triggering the upgrade process, don't allow triggering an upgrade to an older version of the software.

In other words: If a user wants to downgrade, they will have to do the work of running the installer for the older version (and possibly uninstalling the newer version first).

This modified behavior addresses the problem mentioned in the article (a newer version of software running the installer for an older version), but still gives users the power to install an older version if they want.

Bootvis 1486 days ago

Not entirely clear to me that would be sufficient a mitigation on this case: the endpoint could claim Zoom version 999 is served and serve the old exe and cab which then would be run, possibly before other checks can even be done.

joefkelley 1486 days ago

> 3) Don't transparently download and install stuff without user interaction, regardless of where it comes from!

This is an interesting one. I totally get your point. But also users are terrible about updating their software if you give them the choice. Automatic updates have very practical security benefits. I've witnessed non-technical folks hit that "remind me later" button for years.

RhodesianHunter 1486 days ago

> I've witnessed non-technical folks hit that "remind me later" button for years.

Doesn't that then become their problem and responsibility then?

account42 1485 days ago

> I've witnessed non-technical folks hit that "remind me later" button for years.

Maybe take the hint and add a "no" button instead of this manipulative "remind me later" shit.

bombcar 1486 days ago

I doubt anyone actively revokes certificates ever - perhaps maybe the game console makers.

crismigo 1486 days ago

dsdas

jerf 1486 days ago

Unfortunately, the problem here is programmers moreso than formats. It literally doesn't matter what you specify, programmers will not implement it to a T. Most programmers simply don't know that every single detail matters. Many of those who may have some idea don't really care, since they can't imagine how something like this could happen.

It's not just XML. It's every ecosystem I've ever used. Push it around the edges and you will find things.

This is neat, not because it is special to JSON in particular but because it's an example of examining a good chunk of a large ecosystem: https://seriot.ch/projects/parsing_json.html Consider this is likely to be true in any ecosystem that doesn't make it a top priority to avoid.

IshKebab 1486 days ago

I disagree. The way the format is designed has a direct effect on how likely implementors are to implement it correctly. So the format designers bear some responsibility.

For example how many Protobuf parser libraries have security bugs? I'm guessing very few because the standard is nice and simple, and it's very clearly defined without much "it's probably like this" wiggle room (much easier for binary formats!).

XML had a ton of unnecessary complexity that could have been avoided to make implementations simpler. I haven't actually read this bug so let's see if it was one of:

* Closing tags having to repeat the name / two different ways of closing tags.

* CDATA

* Namespaces (especially how they are defined)

* &entities;

Edit: Ha it wasn't any of those - but it was still an issue with text based formats. Seems like Expat assumes the content is valid UTF-8 (and doesn't validate it), while Gloox assumes it is ASCII. Obviously this couldn't have happened with binary formats.

If you care about security DON'T USE TEXT FORMATS!

bawolff 1486 days ago

XML is a bad text based format. It doesn't know if it wants to be human readable or computer readable so it does both poorly (if you think this vuln is bad, check out some of the saml vulns).

I wouldn't blame xml's silliness on text based formats in general, even if they are full of risks.

salawat 1486 days ago

Wrong.

If you care about security, verify your goddamn invariants.

This is not a software problem. This is a lazy programmer/software engineer problem. Electrical Engineering, or hell, any matyre engineering field understands this concept.

If you have mot read your entire codepath, you have no idea what it is you are doing.

Welcome to why my life as a QA is effing miserable. Every bit of ignorance by devs following the philosophy of "abstraction is good" is dealt with at the level of Software BoM audit.

All hail Time to Market!

IshKebab 1486 days ago

> This is not a software problem. This is a lazy programmer/software engineer problem.

The old "good programmers don't write bugs" fallacy. How do so many people still think like this in 2022??

salawat 1485 days ago

There is a difference between not writing bugs, and checking your invariants.

If you have not read implementation code you are dependent on, you by sefinition have not had the signal that raises that invariant violation into your consciousness.

It would be like a civil engineer building a bridge out of limestone at a thickness that would require a larger thickness of steel and just saying "to hell with it, go figure it out".

The write-only programmer is a threat to themselves and everyone around them. And to be frank, even more dangerous are those members of management who have their expectations around implementation time so skewed by this cavalier attitude toward knowing the dynamics of your stack.

You will make bugs. Crossed invariants are completely preventable though.

KronisLV 1486 days ago

> If you care about security, verify your goddamn invariants.

While it would be nice to be able to do this, sadly we don't have infinite resources, lest we be okay with actually shipping software in 5-10 years instead of 1-2. I know that I would be okay with such a world, but people who pay my salary might not share that point of view. Nor do the people who would have to choose an app to use in the near future, instead of waiting for a decade to do so.

> This is not a software problem. This is a lazy programmer/software engineer problem. Electrical Engineering, or hell, any matyre engineering field understands this concept.

The thing is, that the majority of the development out there is like the Wild West. If my code throws a NullPointerException or a NullReferenceException, then someone is going to be mildly annoyed and it might result in a Jira issue to fix. Code failing in a variety of ways is almost considered normal in some respects, outside of specific (expensive) contexts, where correctness matters a lot.

Admittedly, even in programming there are fields where the stakes are higher, though writing code for planes (as an example) is wildly different than what 90% of people out there would call "programming". Personally, I'd like 100% test coverage (lines, code branches, everything), but outside of these high stakes environments it would be wasteful to do so.

> If you have mot read your entire codepath, you have no idea what it is you are doing.

For many out there, this is pretty much impossible to do in a meaningful way. Let's use something like the Spring framework, a popular option in Java for web dev, a stack that has a rather high level of abstraction. In it, the actual code path that you're dealing with would involve your application code, the framework code (which is likely many times longer than your actual application, uses reflection and other complex mechanisms, overall being truly Eldritch at times), any integrated libraries, as well as the JVM and some other code on your actual system, that interfaces with the JVM.

Even if you toss out Java from the stack, the actual hot code path in any non-trivial piece of software will be pretty difficult to reason about, due to different types of linking, different external package versions etc. Unless you feel okay with very, very slowly stepping through everything with a debugger, which probably still won't give you too good of an idea of what's actually happening and what should have happened.

Though maybe traversing 20 layers of abstraction in Spring and coming out of that debugging session more confused than you were than when you entered it is just a Java/Spring thing, who knows.

> Welcome to why my life as a QA is effing miserable. Every bit of ignorance by devs following the philosophy of "abstraction is good" is dealt with at the level of Software BoM audit.

I think there's plenty of misery to be had all around. For a humorous take at the state of things, have a look at this article: https://www.stilldrinking.org/programming-sucks

> All hail Time to Market!

All hail being able to pay rent by delivering sub-optimal software to meet ever changing business demands in an environment where nobody wants to pay for perfect software. That's simply the world we live in, take it or leave it (e.g. pursue whichever environment feels better to you, within the bounds of your opportunities in life).

salawat 1485 days ago

And thus we come back to the age old quandry. The implicit act of economic violence tucked away into our current society.

I have capital, you don't, do what I want, or starve.

It always comes back to violence.

This is why we waste so much time reinventing things and white labelling, and subjecting other professions to the most ungodly tooling. We mass-produce suffering. We engineer it into the product in the form of lack of care under the guide of "we're innovating guyz".

KronisLV 1485 days ago

Ehh, that's a somewhat grim view, though I doubt I can offer many valuable points about the wider nature of capitalism as a whole.

That said, what I can say is that there definitely is a wide spectrum of different circumstances that people are dealing with and therefore the level of care that certain things will get will also vary.

For example, would it be cool to spend 2 decades working on the perfect GUI framework that'd be tested, dependable, performant and would also have exceedingly good usability? Sure. Is that going to happen in our current world? Perhaps not.

But hey, starting out with a bit of pushback and selling the concept of TDD or quality gates is a start as well, or even having proper tests for all of the important business logic, whilst willingly ignoring (putting off) the things that are just infeasible.

twoodfin 1486 days ago

This is just so basic a screwup though. The W3C spec for XML has had a formal syntactic description of valid tag names for decades:

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-common-sy...

Plenty of libraries get this right because it’s so easy. You’d almost have to try—probably by being “clever”—to get it wrong.

lucideer 1486 days ago

While I'm not defending the screw-up here - it's bad - it does do it a slight injustice to omit that the issue was not something simplistic around ascii/utf8 parsing but rather failing to reject/escape malformed-UTF8 strings. Unicode handling even in actual programming language implementations is an extremely common and well-documented problem.

remus 1486 days ago

I think it's worth remembering that XML parsing is also a big historic source of bugs which suggests to me that while it may look simple and well formed on the surface it's probably a lot harder than it looks.

funcDropShadow 1486 days ago

Could you give examples? There were plenty of problems with certain standards layered atop of XML or self-made implementations of XML parsers and unparsers [1], but there is also a well tested set of standard compliant XML libraries that avoid those issues.

[1]: An internationally known consulting firm, that I won't name, had (perhaps has) an internal tool that compiles an Excel description of a service interface into actual XML parsing code that accepts only one hard-coded namespace alias for each given namespace. Over the years I've come across multiple companies with that bug in some service. Everytime I looked into it, the reason was the same internal tool of that consulting firm. And I've met multiple times people who had already discovered that same thing.

lucideer 1486 days ago

I have the same question as the sibling commenter: are you sure you mean parsing (i.e. well-formedness) and not handling (i.e. logic to do things with the parsed data: e.g. xxe, namespace separation, etc.

Obviously all software has some bugs and I'm sure XML parsers are no exception but I haven't been personally aware of any high profile ones before this.

For a quick example of a lowish-level XML bug that isn't parsing-related, I reported a bug many years ago in a piece of software whereby attributes without curie prefixes were being placed into the wrong namespace. A weird quirk of the XML spec is that unprefixed tags go into the default namespace but unprefixed attributes go into a "NULL" namespace (or, if I recall correctly, sometimes a specific namespace depending on the tag?). That's not a parser bug though since the parser has parsed the tag, attributes and associated prefix strings (or lack thereof) correctly: it just does something wrong post-parsing.

I feel like that class of bug is very common with XML, but it's more of an application stability concern than a security one (XXE being a notable exception just because it deals with IO)

mwcampbell 1486 days ago

IMO the best response to this kind of analysis is to humbly realize that any of us, working under real-world pressures, could make such a screw-up, and contemplate how we'll remain vigilant and mitigate the damage that comes from our inevitable screw-ups.

mwcampbell 1486 days ago

I suppose it's safest to use a binary format where variable-length fields are prefixed with their length.

jerf 1486 days ago

Assuming properly-created data, yes. You aren't immune to problems but you will reduce them, especially in a memory-safe language.

Unfortunately, in a security context, that is not only not guaranteed, but will be actively attacked, so in practice I'm not sure it buys you that much from a security perspective. A net positive, I think, but certainly not enough that you ca metaphorically kick back and enjoy your lemonade.

The binary format is one of the oldest of security vulnerabilities, by simply claiming a length of larger than the buffer allocated in the C program, though I'm inclined to credit that particular joy to C and not the data itself. Nowadays there aren't many languages where simply claiming to be really long will get you anywhere like that.

amluto 1486 days ago

More generally, if you want to include a block of untrustworthy structured data in a protocol, it’s very much preferable to do so in a way that does not require inspecting the data in question to figure out where it ends and thus where the outer protocol resumes.

English is not immune. Think about “who’s on first” — there is no way to distinguish the untrustworthy name “who” from a grammatical part of the conversation.

jandrese 1486 days ago

Sure if you like ingesting 4GB records. There is nothing inherently safer in binary formats. It's easy to write parsers that can handle properly formatted files, it is when you're dealing with corrupt or misformed files that everything gets complicated.

teakettle42 1486 days ago

> There is nothing inherently safer in binary formats.

Sure there is. Barring a pathologically bad wire format design, they’re easier to parse than an equivalent human editable encoding.

Eliminating the human-editing ability requirement also enables us to:

- Avoid introducing character encoding — a huge problem space just on its own — into the list of things that all parsers must get right.

- Define non-malleable encodings; in other words, ensure that there exists only one valid encoding for any valid message, eliminating parser bugs that emerge around handling (or not) multiple different ways to encode the same thing.

userbinator 1486 days ago

Define non-malleable encodings; in other words, ensure that there exists only one valid encoding for any valid message, eliminating parser bugs that emerge around handling (or not) multiple different ways to encode the same thing.

I've said similar things to this before. E.g. if you want a boolean, there's nothing simpler and less error-prone than a single bit. It represents exactly the values you need; nothing more and nothing less. You could take a byte if you didn't want to pack, and use the "0 is false, nonzero is true" convention, which is naturally usable in a lot of programming languages; that way there are 256 different values, but the set of inputs is still small and finite with each one having a defined interpretation.

ajsnigrutin 1486 days ago

Sure, until someone sets the prefix to 100MB large, and sends zero bytes of data :)

account42 1485 days ago

Which would be a lot easier to catch by bounds checks in the language / data types used / sanitizers / fuzzers / static analysis than cases like this where you can have two implementations seemingly successfully parse the data but disagree on the result.

lmm 1486 days ago

Programmers respond to their incentives. Like most security bugs, this one happened because someone was dumb enough to use C for something connected to the internet. But the reason programmers do that is because of a culture that rewards fast and insecure more than slightly less fast and correct.