|
There is a particular use case which leads to frustration with Python 3, if you don't know the latin1 trick. The use case is when you have to deal with files that are encoded in some unknown ASCII-compatible encoding. That is, you know that bytes with values 0–127 are compatible with ASCII, but you know nothing whatsoever about bytes with values 128–255. The use case arises when you have files produced by legacy software where you don't know what the encoding is, but you want to process embedded ASCII-compatible parts of the file as if they were text, but pass the other parts (which you don't understand) through unchanged (for example, the files are documents in some markup language, and you want to make automatic edits to the markup but leave the rest of the text unchanged). Processing as text requires you to decode it, but you can't decode as 'ascii' because there are high-bit-set characters too. The trick is to decode as latin1 on input, process the ASCII-compatible text, and encode as latin1 on output. The latin1 character set has a code point for every byte value, and bytes with the high bit set will pass through unchanged. So even if the file was actually utf-8 (say), it still works to decode and encode it as latin1, and multi-byte characters will survive this process. The latin1 trick deserves to be better known, perhaps even a mention in the porting guide. |
No it doesn’t. The whole range of 128-159 are undefined. However, the old MS-DOS CP-437 encoding (which is incompatible with latin1/ISO-8859-1) does. So your trick is valid, but not with latin1.