Hacker News new | ask | show | jobs
by svnpenn 1424 days ago
> You need to know the encoding of any text, otherwise it’s impossible to decipher the message (although it’s common for applications to assume the encoding).

Even with the caveat in parentheses, this is quite misleading. For example, the following line is some text, with no specified encoding:

> hello world

now, while its true this could be some exotic encoding, or maybe just random binary data, I wouldn't call it impossible to decipher. More accurate, would be "impossible to decipher with 100% certainty". Same issue exists with Protocol Buffers, or any format that is not self-describing. The data is not a black box, its just annoying to deal with.

2 comments

Right. Impossible might have been an exaggeration, I will fix that. The point is that if you're reading a file with the text "hello world", you can only make out the characters because you know the encoding. Given two completely different encodings that map the same hex values in the message it would be impossible to determine which is the correct string. There is no such thing as plain text.
> Given two completely different encodings that map the same hex values in the message it would be impossible to determine which is the correct string.

Sorry, but I don't agree with this either. You can, as a human being (or smart enough AI), look at the result in both encodings, and make an educated guess as to which is correct. If they are wholly different as you say, then one should be gibberish, and one should map to some dictionary.

What you are feeling is called cognitive dissonance. You have the idea of text so hammered into your mind, that when you realize it's merely a convention that makes it readable in practice without needing to know the encoding, you cannot even concede, despite this being an obvious truth. This phenomenon is called "un*x braindamage".

More or less all possible interpretations of what this person said are correct. But UN*X braindamage also comes with dunning-kruger due to the fact that you've memorized so many factoids after many years and think someone doesn't know what they're talking about when they get them wrong despite their overall idea being correct.

The only reason it's easy to decode (as in, by a casual, not requiring information theoretic techniques or something like file(1)), is because almost all popular character encodings have went far out of their way to map the first 128 bytes to ASCII. This idea that text is the common medium / lowest common denominator is a misconception and why UN*X is buggy and half working. Just because it appears easy to read in common tools doesn't mean you have a correct semantic understanding of it. Text is also inefficient and leads to escaping problems whereby it becomes unreadable again.
I have no idea what you mean by "this idea that test is the common medium/ lowest common denominator", nor what these "escaping problems" are (I have to escape ASCII codes 0x00 through 0x1f, I guess, but it's unclear to me why that makes the result unreadable, especially since I hardly ever have to escape anything but \n and maybe \t. And the claim that "UN*X is buggy and half working" is just bizarre.
The UN*X mantra is that text is the common medium and data should be transferred as plain text, as opposed to any other way of encoding data structures like binary.

Escaping problems as in, you embed data structures into text via JSON or XML, and have to write \uXXXX and \" etc, making it unreadable once again.

No, the claim that UN*X is stable is bizzare.