Hacker News new | ask | show | jobs
by jerf 4012 days ago
Just checked, and \v is illegal in the characters of an XML document: http://www.w3.org/TR/REC-xml/#dt-text

But you should have gotten an error, of course, not the silent truncation you imply.

If you need to salvage the character, your XML library may let you specify it as &#0b;. That is still a violation, but a lot of libraries seem to let it through: http://www.w3.org/TR/REC-xml/#sec-references (see "Well-formedness constraint"... you are specifically not allowed to use this to do what I'm suggesting here).

Anyways, the moral here is that XML CAN NOT carry arbitrary binary, and EVERY TIME you output something in XML, something in the system needs to run some sort of encoding & illegal-character cleaning pass on the output text. The moral equivalent of "<tag>$content</tag>" in your language is ALWAYS wrong, unless you specifically processed $content into XML character content earlier. This is true even when your really sure $content is "safe". Even if you're right... and statistically speaking, you're not... do it correctly anyhow and call the right encoding function.

1 comments

I've dealt with vertical tabs and linefeeds by just Base64-encoding character data that might include them before stuffing it into a CDATA node in the XML doc.

It's a hack, sure, having to encode/decode all the time, but if you need to store those characters, it's the only bulletproof way I've found.

I have to admit I'm still kind of split on whether XML made the right call here. It's tricky with character encodings to allow arbitrary binary in the characters, but something like CDATA could have permitted it, perhaps with a shell-like specification of a terminating byte sequence, or even with a UTF-8-style prefix number that indicates the length. This sounds great to me at first. But then I put on my security hat and consider what horrors would transpire in the bowels of programs unprepared to handle binary or somehow can be tricked during validation vs. parsing or any number of other nightmares one could do with this, and I go back to neutral-at-best. (I'd go negative, but on the other, other hand [1], a lot of these things are already happening as people blithely stuff these things in to XML documents anyhow, standard or no.)

[1]: No, not gripping hand... that's only for when the third choice is the dominant/default/obviously-correct-once-I-say-it choice.

Yep, that's a correct and fairly standard way of embedding binary data in XML. Base64 Encode.

Always makes me nostalgic for usenet. Which yes, technically was UUEncode back in usenet days, some slight technical differences from Base64 Encode.