Hacker News new | ask | show | jobs
by colomon 4882 days ago
It seems to me it is a very valid principle in many areas.

For instance, the STEP file standard very clearly states that all input files must be 7-bit ASCII. Many of the programs that generate these files (including earlier versions of my own) paid no attention to this and wrote out 8-bit values in strings if the user requested it. Clearly this behavior is wrong. (The principle agrees: "Be conservative in what you do.")

However, rejecting an entire CAD file merely because the text strings in it used an illegal encoding is downright silly. It in no way can change the meaning of the geometry of the file. There is no hidden vector in there for malicious attacks. It makes perfect sense to accept illegal files like this and do your best to make them work, even if it might not get quite the same text strings the user intended.

I think jbert's point about being conservative in what you do in all respects is a strong one. Taking that into account suggests that maybe carefully marking the illegal character as such in the string might well be worthwhile, and is definitely more appropriate than trying to guess what 8-bit character standard was intended.

3 comments

That's exactly the kind of security risk that the article is talking about. Internet Explorer could be tricked to use US-ASCII encoding and interpret ¼script¾ as a script tag (CVE 2006-3227)

Liberal vs strict is a false dichotomy. The third solution is to accept all possible inputs, but in a specified way.

Instead of taking draconian XML approach you can solve the problem by taking HTML5 approach and make error handling as interoperable as handling of correct input. In case of STEP files you could require all implementations to clear the 8th bit (or drop or clamp bytes out of range — whatever as long as it's specified and mandatory).

Maybe I'm missing something here, but a valid STEP string can already encode any arbitrary Unicode code point. It just does it using 7-bit ASCII. If your code is somehow executing these strings without examining their content, then you are already in big, big trouble.

Trying to do something with 8-bit characters -- whether skipping them, indicating an illegal character in the string, or trying to guess what was really meant -- cannot make that situation any worse.

The problem is if you decode a particular byte sequence that causes a bad action (if that's possible with step files) in a different way than some other program that is supposed to keep you safe.

In the case of ie, ie decoded one way and forum software might decode a different way. So the forum software says the string is safe for the browser (according to its decoding rules) but then the browser applies different rules and gets a bad string.

You may not be seeing the danger because you implicitly think a step file from unsafe sources is always unsafe. But imagine if you had a safe file detector program, except it applied different rules than the program you're actually going to open the file with.

As jbert pointed out, if your program's main job is to say whether or not something is safe, and it liberally says "Oh yeah, I think that's safe", that's pretty much the exact opposite of "be conservative in what you do".
Please explain the proper way of escaping/rejecting html in forum posts, when you can't rely on the browsers following the spec.
Possible attack: Because the strings are not ASCII, implementations now need to bring another library in to decode those strings. Now lets say someone encodes an end-string char (single quote?) using some alternative encoding that doesn't use the ASCII quote char.

When an implementation saves this file, it normalizes that other encoding to use an ASCII single quote, then proceeds to write out the rest of the string. This isn't caught inside the implementation, because the encoding library only normalizes when writing. When it reads the data in, it still just represented it as bytes, and there was no ASCII single quote byte until the end of the dangerous string.

So, yes, it's possible that even something as simple as "string encoding" could be used to implement an attack.

But this is where "be conservative in what you do" comes into play. The STEP format has formal rules for exporting all ASCII, Unicode, and ISO-8859 characters. A well-written STEP string exporter should handle them all without difficulty, no matter what goofy things are in the string.

And again, if you're worried that there may be an attack vector, change high-bit-set characters to "[Illegal character value N]". Though it might be more merciful to assume they just wanted ISO-8859-1 characters and substitute the appropriate control code.

The tl;dr of the article is to define handling of invalid input, so that all conforming implementations will handle it in the same way, without having to reverse-engeneer eachother to be interoperable.
So you're saying that every time I find a STEP file written in an invalid fashion, I should convene an ISO 10303 committee and wait for years to find out how everyone should handle it? That's doubly insane, because it would take many bugs that can be fixed in a day and make my customers suffer from them for years, while at the same time requiring me to modify my program to handle every bug found by every STEP software vendor or cease to be conforming.
If the penalty for generating a CAD file with its strings in the wrong encoding is that no importer will read it because they're being strict in what they accept, then no exporter that does so will last very long in the wild.