Hacker News new | ask | show | jobs
by jerf 1486 days ago
Unfortunately, the problem here is programmers moreso than formats. It literally doesn't matter what you specify, programmers will not implement it to a T. Most programmers simply don't know that every single detail matters. Many of those who may have some idea don't really care, since they can't imagine how something like this could happen.

It's not just XML. It's every ecosystem I've ever used. Push it around the edges and you will find things.

This is neat, not because it is special to JSON in particular but because it's an example of examining a good chunk of a large ecosystem: https://seriot.ch/projects/parsing_json.html Consider this is likely to be true in any ecosystem that doesn't make it a top priority to avoid.

4 comments

I disagree. The way the format is designed has a direct effect on how likely implementors are to implement it correctly. So the format designers bear some responsibility.

For example how many Protobuf parser libraries have security bugs? I'm guessing very few because the standard is nice and simple, and it's very clearly defined without much "it's probably like this" wiggle room (much easier for binary formats!).

XML had a ton of unnecessary complexity that could have been avoided to make implementations simpler. I haven't actually read this bug so let's see if it was one of:

* Closing tags having to repeat the name / two different ways of closing tags.

* CDATA

* Namespaces (especially how they are defined)

* &entities;

Edit: Ha it wasn't any of those - but it was still an issue with text based formats. Seems like Expat assumes the content is valid UTF-8 (and doesn't validate it), while Gloox assumes it is ASCII. Obviously this couldn't have happened with binary formats.

If you care about security DON'T USE TEXT FORMATS!

XML is a bad text based format. It doesn't know if it wants to be human readable or computer readable so it does both poorly (if you think this vuln is bad, check out some of the saml vulns).

I wouldn't blame xml's silliness on text based formats in general, even if they are full of risks.

Wrong.

If you care about security, verify your goddamn invariants.

This is not a software problem. This is a lazy programmer/software engineer problem. Electrical Engineering, or hell, any matyre engineering field understands this concept.

If you have mot read your entire codepath, you have no idea what it is you are doing.

Welcome to why my life as a QA is effing miserable. Every bit of ignorance by devs following the philosophy of "abstraction is good" is dealt with at the level of Software BoM audit.

All hail Time to Market!

> This is not a software problem. This is a lazy programmer/software engineer problem.

The old "good programmers don't write bugs" fallacy. How do so many people still think like this in 2022??

There is a difference between not writing bugs, and checking your invariants.

If you have not read implementation code you are dependent on, you by sefinition have not had the signal that raises that invariant violation into your consciousness.

It would be like a civil engineer building a bridge out of limestone at a thickness that would require a larger thickness of steel and just saying "to hell with it, go figure it out".

The write-only programmer is a threat to themselves and everyone around them. And to be frank, even more dangerous are those members of management who have their expectations around implementation time so skewed by this cavalier attitude toward knowing the dynamics of your stack.

You will make bugs. Crossed invariants are completely preventable though.

> If you care about security, verify your goddamn invariants.

While it would be nice to be able to do this, sadly we don't have infinite resources, lest we be okay with actually shipping software in 5-10 years instead of 1-2. I know that I would be okay with such a world, but people who pay my salary might not share that point of view. Nor do the people who would have to choose an app to use in the near future, instead of waiting for a decade to do so.

> This is not a software problem. This is a lazy programmer/software engineer problem. Electrical Engineering, or hell, any matyre engineering field understands this concept.

The thing is, that the majority of the development out there is like the Wild West. If my code throws a NullPointerException or a NullReferenceException, then someone is going to be mildly annoyed and it might result in a Jira issue to fix. Code failing in a variety of ways is almost considered normal in some respects, outside of specific (expensive) contexts, where correctness matters a lot.

Admittedly, even in programming there are fields where the stakes are higher, though writing code for planes (as an example) is wildly different than what 90% of people out there would call "programming". Personally, I'd like 100% test coverage (lines, code branches, everything), but outside of these high stakes environments it would be wasteful to do so.

> If you have mot read your entire codepath, you have no idea what it is you are doing.

For many out there, this is pretty much impossible to do in a meaningful way. Let's use something like the Spring framework, a popular option in Java for web dev, a stack that has a rather high level of abstraction. In it, the actual code path that you're dealing with would involve your application code, the framework code (which is likely many times longer than your actual application, uses reflection and other complex mechanisms, overall being truly Eldritch at times), any integrated libraries, as well as the JVM and some other code on your actual system, that interfaces with the JVM.

Even if you toss out Java from the stack, the actual hot code path in any non-trivial piece of software will be pretty difficult to reason about, due to different types of linking, different external package versions etc. Unless you feel okay with very, very slowly stepping through everything with a debugger, which probably still won't give you too good of an idea of what's actually happening and what should have happened.

Though maybe traversing 20 layers of abstraction in Spring and coming out of that debugging session more confused than you were than when you entered it is just a Java/Spring thing, who knows.

> Welcome to why my life as a QA is effing miserable. Every bit of ignorance by devs following the philosophy of "abstraction is good" is dealt with at the level of Software BoM audit.

I think there's plenty of misery to be had all around. For a humorous take at the state of things, have a look at this article: https://www.stilldrinking.org/programming-sucks

> All hail Time to Market!

All hail being able to pay rent by delivering sub-optimal software to meet ever changing business demands in an environment where nobody wants to pay for perfect software. That's simply the world we live in, take it or leave it (e.g. pursue whichever environment feels better to you, within the bounds of your opportunities in life).

And thus we come back to the age old quandry. The implicit act of economic violence tucked away into our current society.

I have capital, you don't, do what I want, or starve.

It always comes back to violence.

This is why we waste so much time reinventing things and white labelling, and subjecting other professions to the most ungodly tooling. We mass-produce suffering. We engineer it into the product in the form of lack of care under the guide of "we're innovating guyz".

Ehh, that's a somewhat grim view, though I doubt I can offer many valuable points about the wider nature of capitalism as a whole.

That said, what I can say is that there definitely is a wide spectrum of different circumstances that people are dealing with and therefore the level of care that certain things will get will also vary.

For example, would it be cool to spend 2 decades working on the perfect GUI framework that'd be tested, dependable, performant and would also have exceedingly good usability? Sure. Is that going to happen in our current world? Perhaps not.

But hey, starting out with a bit of pushback and selling the concept of TDD or quality gates is a start as well, or even having proper tests for all of the important business logic, whilst willingly ignoring (putting off) the things that are just infeasible.

This is just so basic a screwup though. The W3C spec for XML has had a formal syntactic description of valid tag names for decades:

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-common-sy...

Plenty of libraries get this right because it’s so easy. You’d almost have to try—probably by being “clever”—to get it wrong.

While I'm not defending the screw-up here - it's bad - it does do it a slight injustice to omit that the issue was not something simplistic around ascii/utf8 parsing but rather failing to reject/escape malformed-UTF8 strings. Unicode handling even in actual programming language implementations is an extremely common and well-documented problem.
I think it's worth remembering that XML parsing is also a big historic source of bugs which suggests to me that while it may look simple and well formed on the surface it's probably a lot harder than it looks.
Could you give examples? There were plenty of problems with certain standards layered atop of XML or self-made implementations of XML parsers and unparsers [1], but there is also a well tested set of standard compliant XML libraries that avoid those issues.

[1]: An internationally known consulting firm, that I won't name, had (perhaps has) an internal tool that compiles an Excel description of a service interface into actual XML parsing code that accepts only one hard-coded namespace alias for each given namespace. Over the years I've come across multiple companies with that bug in some service. Everytime I looked into it, the reason was the same internal tool of that consulting firm. And I've met multiple times people who had already discovered that same thing.

I have the same question as the sibling commenter: are you sure you mean parsing (i.e. well-formedness) and not handling (i.e. logic to do things with the parsed data: e.g. xxe, namespace separation, etc.

Obviously all software has some bugs and I'm sure XML parsers are no exception but I haven't been personally aware of any high profile ones before this.

For a quick example of a lowish-level XML bug that isn't parsing-related, I reported a bug many years ago in a piece of software whereby attributes without curie prefixes were being placed into the wrong namespace. A weird quirk of the XML spec is that unprefixed tags go into the default namespace but unprefixed attributes go into a "NULL" namespace (or, if I recall correctly, sometimes a specific namespace depending on the tag?). That's not a parser bug though since the parser has parsed the tag, attributes and associated prefix strings (or lack thereof) correctly: it just does something wrong post-parsing.

I feel like that class of bug is very common with XML, but it's more of an application stability concern than a security one (XXE being a notable exception just because it deals with IO)

IMO the best response to this kind of analysis is to humbly realize that any of us, working under real-world pressures, could make such a screw-up, and contemplate how we'll remain vigilant and mitigate the damage that comes from our inevitable screw-ups.
I suppose it's safest to use a binary format where variable-length fields are prefixed with their length.
Assuming properly-created data, yes. You aren't immune to problems but you will reduce them, especially in a memory-safe language.

Unfortunately, in a security context, that is not only not guaranteed, but will be actively attacked, so in practice I'm not sure it buys you that much from a security perspective. A net positive, I think, but certainly not enough that you ca metaphorically kick back and enjoy your lemonade.

The binary format is one of the oldest of security vulnerabilities, by simply claiming a length of larger than the buffer allocated in the C program, though I'm inclined to credit that particular joy to C and not the data itself. Nowadays there aren't many languages where simply claiming to be really long will get you anywhere like that.

More generally, if you want to include a block of untrustworthy structured data in a protocol, it’s very much preferable to do so in a way that does not require inspecting the data in question to figure out where it ends and thus where the outer protocol resumes.

English is not immune. Think about “who’s on first” — there is no way to distinguish the untrustworthy name “who” from a grammatical part of the conversation.

Sure if you like ingesting 4GB records. There is nothing inherently safer in binary formats. It's easy to write parsers that can handle properly formatted files, it is when you're dealing with corrupt or misformed files that everything gets complicated.
> There is nothing inherently safer in binary formats.

Sure there is. Barring a pathologically bad wire format design, they’re easier to parse than an equivalent human editable encoding.

Eliminating the human-editing ability requirement also enables us to:

- Avoid introducing character encoding — a huge problem space just on its own — into the list of things that all parsers must get right.

- Define non-malleable encodings; in other words, ensure that there exists only one valid encoding for any valid message, eliminating parser bugs that emerge around handling (or not) multiple different ways to encode the same thing.

Define non-malleable encodings; in other words, ensure that there exists only one valid encoding for any valid message, eliminating parser bugs that emerge around handling (or not) multiple different ways to encode the same thing.

I've said similar things to this before. E.g. if you want a boolean, there's nothing simpler and less error-prone than a single bit. It represents exactly the values you need; nothing more and nothing less. You could take a byte if you didn't want to pack, and use the "0 is false, nonzero is true" convention, which is naturally usable in a lot of programming languages; that way there are 256 different values, but the set of inputs is still small and finite with each one having a defined interpretation.

Sure, until someone sets the prefix to 100MB large, and sends zero bytes of data :)
Which would be a lot easier to catch by bounds checks in the language / data types used / sanitizers / fuzzers / static analysis than cases like this where you can have two implementations seemingly successfully parse the data but disagree on the result.
Programmers respond to their incentives. Like most security bugs, this one happened because someone was dumb enough to use C for something connected to the internet. But the reason programmers do that is because of a culture that rewards fast and insecure more than slightly less fast and correct.