|
|
|
|
|
by int_19h
3403 days ago
|
|
I'm not at all opposed to UTF-8 as internal encoding for strings. But that's completely different and orthogonal to what you're talking about, which is whether strings and byte arrays should be one and the same thing semantically, and represented by the same type, or by compatible types that are used interchangeably. I want my strings to be strings - that is, an object for which it is guaranteed that enumerating codepoints always succeeds (and, consequently, any operation that requires valid codepoints, like e.g. case-insensitive comparison, also succeeds). This is not the case for "just assume a bytearray is UTF-8 if used in string context", which is why I consider this model broken on the abstraction level. It's kinda similar to languages that don't distinguish between strings and numbers, and interpret any numeric-looking string as number in the appropriate context. FWIW, the string implementation in Python 3 supports UTF-8 representation. And it could probably be changed to use that as the primary canonical representation, only generating other ones if and when they're requested. |
|
Thus Go's strings don't potentially fail your desire for a guarantee anymore than anything else would assuming UTF8. They're unicode-by-default, which was the whole point to Python3 but Go has it too in a more elegant way. That's the beauty to UTF8 by default, you can pass it valid UTF8 or ASCII since it's a subset, the context of which it's being received is up to you. If you're expecting bytes it works, if you're expecting unicode codepoints that works. There's no reason to get your hands dirty with encodings unless you need to decode UTF16 etc first. If there is still a concern about data validation, that's up to you not your string type to throw an exception.