| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ygra 3731 days ago
	Except that U+FEFF is specified to be only used as byte-order indicator when appearing at the start of a text stream and does not belong to the text content. It does not corrupt data because it's not part of the data for any conforming application. Non-conforming applications doing wrong things because they don't bother to follow the standard is hardly surprising, then. Yes, Unicode is messy and could have been better designed (it was designed so that there is an easy conversion path for any pre-existing encoding – thus concerned itself more with making it easy to convert content in legacy encodings to Unicode, instead of making it easy to implement applications in a way that they support Unicode), but it's still orders of magnitude better than anything that came before it when it comes to representing text in general. And it's mostly complicated because languages and scripts are complicated.

1 comments

MichaelGG 3730 days ago

OK, then I cat two files together. Somehow thinking that one can unilaterally declare every application in the world as "non-conforming" is just sticking your head in the sand.

ygra 3729 days ago

Well, the notion that text can be safely treated and manipulated as binary data and still make sense is what doesn't work here. It's the wrong tool for the job, then. That's kinda the same as reversing text byte by byte and complaining that all diacritics are wrong.

Yes, cat comes from simpler times, but if cat cannot be changed for compat reasons, then it should no longer be used to concatenate text files. At least if the result is somehow important. Text and binary data are simply two very different things and both need to be processed accordingly. Sure, there are a bunch of other operations that are immediately recognisable as not making sense on text at all and superficially concatenating files is not one of them, but in my eyes that's a bit shortsighted. You can safely use methods for binary data on text iff you know exactly what your text contains and that the operation is safe. Otherwise you may mangle things.