Hacker News new | ask | show | jobs
by garethrees 2810 days ago
There is a particular use case which leads to frustration with Python 3, if you don't know the latin1 trick.

The use case is when you have to deal with files that are encoded in some unknown ASCII-compatible encoding. That is, you know that bytes with values 0–127 are compatible with ASCII, but you know nothing whatsoever about bytes with values 128–255.

The use case arises when you have files produced by legacy software where you don't know what the encoding is, but you want to process embedded ASCII-compatible parts of the file as if they were text, but pass the other parts (which you don't understand) through unchanged (for example, the files are documents in some markup language, and you want to make automatic edits to the markup but leave the rest of the text unchanged). Processing as text requires you to decode it, but you can't decode as 'ascii' because there are high-bit-set characters too.

The trick is to decode as latin1 on input, process the ASCII-compatible text, and encode as latin1 on output. The latin1 character set has a code point for every byte value, and bytes with the high bit set will pass through unchanged. So even if the file was actually utf-8 (say), it still works to decode and encode it as latin1, and multi-byte characters will survive this process.

The latin1 trick deserves to be better known, perhaps even a mention in the porting guide.

9 comments

> The latin1 character set has a code point for every byte value

No it doesn’t. The whole range of 128-159 are undefined. However, the old MS-DOS CP-437 encoding (which is incompatible with latin1/ISO-8859-1) does. So your trick is valid, but not with latin1.

I can’t edit my post now, but it turns out I was wrong. The range of 128-159 are defined in ISO-8859-1, as little-used “control characters”:

https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set

So, the trick described by garethrees does work with latin1, and I was mistaken.

> No it doesn’t. The whole range of 128-159 are undefined.

Not in the sense of being decodable. If you decode a byte string with Latin-1, you get a unicode string containing code points 0-255 only, each code point matching exactly the numerical value of the corresponding byte in the byte string. So you can recover exactly the original byte string by re-encoding. Plus, every possible byte string is valid for decoding in Latin-1, so you will never get any decode/encode errors. As long as you don't care about the semantic meaning of bytes 128-255, this allows you to preserve the data while still working with Unicode strings.

"Undefined" bytes in this context does not equate to "invalid" bytes; they were more like don't-cares. Say, let's assume that they were invalid per se, then ISO/IEC 8859-1 would not allow a newline and tab that are not defined in ISO/IEC 8859-1 but a part of ISO/IEC 6429 C0 control codes. But a character set without a newline sounds... absurd?

It should be pointed out that the historical model of character sets is much different from today. First, recap:

* A (coded) character set is a partial function from an integer to a defined character meaning.

* A character encoding is a total function from a stream of bytes to either a stream of characters or an error.

ISO/IEC 8859-1 is a coded character set, but not a character encoding. It was possible to treat character sets as character encodings, and in fact this separation became apparent only after the rise of Unicode. But as you see 8859-1 does not have a newline, therefore there should be something else to provide them. Thus there had been "adapter" character encodings that makes use of desired character sets: most prominently ISO/IEC 2022 and ISO/IEC 4873. In most practical implementations of both 6429 is a default building block, so as a character encoding 8859-1 contains 6429, although 8859-1 itself was not really a proper encoding.

One more point: 2022 and 4873 were not only character encodings available at that time. One may simply define character encodings by turning a partial function to a total function or defining a total function from the beginning, and that's what IANA did [1]. IANA's version of 8859-1 ("ISO-8859-1") [2] is a proper character encoding with all control codes defined. And I believe the alias "latin1" actually came from this registration!

[1] https://www.iana.org/assignments/character-sets/character-se...

[2] https://tools.ietf.org/html/rfc1345#page-63

This should definitely be documented somewhere. I (and probably many others) figured this out the hard way by trial-and-erroring through the encodings list.

(For future search engine users, this was for PGN files from Kingbase using the python-chess library)

Would not a better solution be to process the file as a byte string?
I don't think so. If you want to detect and operate on only the data that could represent ASCII characters, you could, certainly process it as a byte string if you wanted, but you'd have to track the presence of non-ASCII-range character codes yourself, and keep state around to represent whether you were in the middle of a multibyte character as you read through the bytes.

If done right, it would be a (probably much slower) re-implementation of what happens when you use the latin1 trick mentioned. You have to get it right, though (sneaky edge cases abound--what if the file starts in the middle of an incomplete multibyte character?).

TL;DR this could technically work but is a poor idea.

This is talking about the case where you don't know the encoding. So you don't know which byte sequences are multibyte characters. Whether you use latin1 or bytes the edge cases are exactly the same, and they don't get handled.
You wouldn't be able to use any APIs that only work on string (unicode) type objects.
I ran into this when trying to process .srt (subtiles) files. The timestamp information is encoded in ASCII, and the actual subtitle text you would like to pass through unaltered. (In my case, I was just adjusting the timestamps).
isn't the correct practice to use errors="surrogateescape" for precisely this purpose with any encoding? So in this case, you would use .decode("ascii", errors="surrogateescape") as the first bytes are the only ones you are sure of, and then .encode("ascii", errors="surrogateescape") to save again
> perhaps even a mention in the porting guide

It is in at least one porting guide:

http://www.catb.org/esr/faqs/practical-python-porting/

Not sure I follow. How does the latin1 trick handle multibyte characters?
_If_ there aren’t any multibyte characters that contain bytes that could be ASCII characters, the ”process the ASCII-compatible text” step doesn't change any multibyte characters, so they round-trip.

Of course, this will break down if multi-byte characters can contain byte values that could be ASCII. It can break HTML or TeX, for example.

If you're looking at legacy 8-bit encodings, you'll be ok, most (all?) of those have ascii as the first 128, or if not (ebdic), you're pretty screwed anyway. For utf-8 you're ok too -- all of the multibyte sequences have the high bit set. For ucs-2 or utf-16, you're likely to screw things up.
It doesn't - the parent says this is for (unknown, but) ascii-compatible encodings - old fashioned codepages.
And then gives the example of utf-8.
UTF-8 is ascii-compatible. Everything with the low bit cleared (characters 0x00-0x7F) is represented identically to ASCII. All codepoints >= 0x80 are represented with multiple bytes with the high bit (0x80) set.

UTF-8 is a very elegant construct for Unix-type C systems — you could basically reuse all your nul-terminated string APIs.

Sure. But it’s not at all clear to me that this trick would actually handle multibyte utf-8 chars correctly.
Consider the codepoint U+1F4A9 ("PILE OF POO").

This encodes to the byte sequence F0 9F 92 A9 in UTF-8. Notice that every one of these bytes has a value > 0x7F, which means they're all outside the ASCII range.

That's one of the useful properties of UTF-8: you know that a code point requiring multi-byte encoding will never contain any bytes that could be confused for ASCII, because every byte of a multi-byte code point will be > 0x7F.

Which in turn means that if you use any processing mechanism that only alters bytes which are in the ASCII range, and passes all other bytes through unmodified, you are guaranteed not to modify or corrupt any multi-byte UTF-8 sequences.

Can you provide example code of the latin1 trick?
This advice is very dangerous, both shift JIS and utf-16 (some of the most common non UTF-8 encodings) can both have things that are are 0-127 ASCII codepoints and things that look like 0-127 ASCII but are in the second part of a multi-byte sequence, and do not represent ASCII equivalent characters at all.
Note that they said "ASCII-compatible encoding". You're right to note the problem with shift-JIS and others, but then, those aren't ASCII-compatible. Whereas utf8 and the iso-8859 series are all ASCII-compatible in that if it looks like an ASCII character it is.
The point is, certain text, especially shift-JIS and the various EUC encodings can look exactly like an 8 bit "extended ASCII" when its in fact a variable width 8-16bit encoding.

Its bad advice that leads to corruption.

If you already know the encoding, then OP's advice is useless, if you don't but suspect its an 8 bit extended ASCII encoding, it might not be, because the aforementioned look exactly like an 8bit encoding.