|
|
|
|
|
by kevin_thibedeau
2426 days ago
|
|
Your proposal only works well for US ASCII users. What if I want to manage multiple ISO-8859 encodings in conjunction with 7-bit ASCII? Maybe I also have some EUC-JP multi-byte text to deal with. It becomes an intractable mess without explicit encoding management. Someone will absolutely end up misinterpreting encoded text as bytes and cause all manner of compatibility and security issues. Having a Unicode string type forces this to be dealt with even if it is inconvenient when taking in data from outside the Python environment. |
|
No, and I explicitly mentioned UTF-8. My suggestion is that str holds arbitrary immutable binary data and that you have a method which can interrogate whether that binary data is valid UTF-8.
Yes, real world text is messy and there are lots of encodings, compression schemes, and exceptions (UTF-8 with byte order marks, overlong encodings, or surrogate pairs, as examples). If your main task is converting text between outdated or broken encodings, I don't have any problem saying you need a separate library and shouldn't burden the rest of the user base. Despite it's flaws, the majority of the world has settled on Unicode with a UTF-8 encoding.
"Special cases aren't special enough to break the rules."