|
It's not at all obvious how it helps, but it does. First, why is Python unable to represent invalid path names as strings? Because internally it converts strings from UTF-8, UTF-16, or any other encoding, to a fixed-with array of decoded Unicode code points. The width of integer used to represent code points is determined by the largest code point in the string: if the string is ASCII, it can use a byte (uint8) per character; if the string is non-ASCII but all BMP, then it can use a uint16 per character; otherwise it has to use uint32 per character. Why does Python do all this? So that you can have O(1) character indexing. If you gave up on that, you wouldn't need to convert the string at all, you could just leave it as (potentially invalid) UTF-8 data. Suppose you get an invalid path on UNIX where paths are UTF-8 by convention? What does Python do with this string? It can't convert it to an array of code points because invalid UTF-8 doesn't correspond to a code point (well, it can if it's just illegal, not malformed, but in general, we have to consider completely malformed strings that don't even follow the basic UTF-8 format). So Python is stuck: it can only replace the invalid data with something like the Unicode replacement character. But then you can't do anything useful with that because it's not the correct name of the path you're trying to work with. How does using UTF-8 to represent strings help? Because you can represent invalid strings: just leave them as-is and don't try to decode them unless you have to. Sure, you can't decode them as code points, but that's actually a pretty unusual thing to do. If someone asks for decoding, _then_ you can give an error. What about Windows where paths are UTF-16 by convention? You can convert them to WTF-8 and everything works out. (Described in way more detail here: https://news.ycombinator.com/item?id=33984308). |
That’s not UTF8. That’s a bag’o bytes which might be UTF8. Very different thing.
> Sure, you can't decode them as code points, but that's actually a pretty unusual thing to do.
It’s not, any unicode-aware text processing does it implicitly. This means any such processing has to either perform its own validation that the input is valid, or it may fly off the rails entirely if fed nonsense. This also increases risks if security issues, either outright UBs, or the ability to smuggle payloads through overlong encoding.