Hacker News new | ask | show | jobs
by teddyh 2810 days ago
> The latin1 character set has a code point for every byte value

No it doesn’t. The whole range of 128-159 are undefined. However, the old MS-DOS CP-437 encoding (which is incompatible with latin1/ISO-8859-1) does. So your trick is valid, but not with latin1.

3 comments

I can’t edit my post now, but it turns out I was wrong. The range of 128-159 are defined in ISO-8859-1, as little-used “control characters”:

https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set

So, the trick described by garethrees does work with latin1, and I was mistaken.

> No it doesn’t. The whole range of 128-159 are undefined.

Not in the sense of being decodable. If you decode a byte string with Latin-1, you get a unicode string containing code points 0-255 only, each code point matching exactly the numerical value of the corresponding byte in the byte string. So you can recover exactly the original byte string by re-encoding. Plus, every possible byte string is valid for decoding in Latin-1, so you will never get any decode/encode errors. As long as you don't care about the semantic meaning of bytes 128-255, this allows you to preserve the data while still working with Unicode strings.

"Undefined" bytes in this context does not equate to "invalid" bytes; they were more like don't-cares. Say, let's assume that they were invalid per se, then ISO/IEC 8859-1 would not allow a newline and tab that are not defined in ISO/IEC 8859-1 but a part of ISO/IEC 6429 C0 control codes. But a character set without a newline sounds... absurd?

It should be pointed out that the historical model of character sets is much different from today. First, recap:

* A (coded) character set is a partial function from an integer to a defined character meaning.

* A character encoding is a total function from a stream of bytes to either a stream of characters or an error.

ISO/IEC 8859-1 is a coded character set, but not a character encoding. It was possible to treat character sets as character encodings, and in fact this separation became apparent only after the rise of Unicode. But as you see 8859-1 does not have a newline, therefore there should be something else to provide them. Thus there had been "adapter" character encodings that makes use of desired character sets: most prominently ISO/IEC 2022 and ISO/IEC 4873. In most practical implementations of both 6429 is a default building block, so as a character encoding 8859-1 contains 6429, although 8859-1 itself was not really a proper encoding.

One more point: 2022 and 4873 were not only character encodings available at that time. One may simply define character encodings by turning a partial function to a total function or defining a total function from the beginning, and that's what IANA did [1]. IANA's version of 8859-1 ("ISO-8859-1") [2] is a proper character encoding with all control codes defined. And I believe the alias "latin1" actually came from this registration!

[1] https://www.iana.org/assignments/character-sets/character-se...

[2] https://tools.ietf.org/html/rfc1345#page-63