|
|
|
|
|
by masklinn
1282 days ago
|
|
> How does using UTF-8 to represent strings help? Because you can represent invalid strings: just leave them as-is and don't try to decode them unless you have to. That’s not UTF8. That’s a bag’o bytes which might be UTF8. Very different thing. > Sure, you can't decode them as code points, but that's actually a pretty unusual thing to do. It’s not, any unicode-aware text processing does it implicitly. This means any such processing has to either perform its own validation that the input is valid, or it may fly off the rails entirely if fed nonsense. This also increases risks if security issues, either outright UBs, or the ability to smuggle payloads through overlong encoding. |
|
True; I was careful not to call it that, but treating strings as UTF-8 by convention does make sense.
> It’s not, any unicode-aware text processing does it implicitly. This means any such things processing has to either perform its own validation that the input is valid, or it may fly off the rails entirely if fed nonsense.
In theory, but that's just not how most string operations actually work. If you have two UTF-8 strings and you want to concatenate them, you just concatenate the bytes. It would be ridiculously inefficient to decode the code points in each string and then re-encode them back into a destination buffer. If you have two UTF-8 strings and you want to see if one is a substring of the other and at what byte index, you just look for the bytes of one as a "substring" of the bytes of the other. Again, it would be ridiculously inefficient to decode the code points in each and do matching on code points. But what if the strings aren't valid UTF-8?! Both of those operations work just fine even if the strings aren't valid and produce sensible, intuitive results.
If you're implementing a browser or a terminal that has to actually display UTF-8 as characters then sure, you have to actually decode characters. Similarly, if you're parsing text somehow, then you have to decode characters. But many program only do concatenation and search and other operations like that which are actually implemented in terms of byte sequences, not characters.