|
|
|
|
|
by dcode
1303 days ago
|
|
In more technical jargon, it is about value spaces. All UTF-8 strings map to WTF-16 strings semantically (lists of Unicode Scalar Values are a subset of lists of Unicode Code Points), but some WTF-16 strings do not map to UTF-8 (Unicode Scalar Values exclude some Code Points). That's something UTF-8-based languages have to deal with anyway (on the Web, which is WTF-16, or any other mixed system), but it's odd that the expectation has become that even languages that map to each other incl. to JS/Web APIs now have to share a problem that does not exist in their native VMs for reasons. To emphasize: These languages just do what's perfectly fine in WebIDL, JSON, ECMAScript and their own language specifications. let myString = "...";
let returnedMyString = someFunction(myString); // might or might not cross boundary
if (myString == returnedMyString) {
// sometimes silently false
}
I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).For context, I once gave a presentation about the pitfalls: https://www.youtube.com/watch?v=Ri2NMnSQo4o |
|
It's because the idea that languages "cannot change" does not appear to be true. UTF-8 is so widespread now that for languages changing the native string representation towards it has become an interesting proposition. Many modern languages (eg: Go and Rust) already picked UTF-8, others such as Swift changed over to it. Then there are implementations of languages like Python (PyPy) that changed their internal encoding even though that was a widespread assumption that it cannot work.
The web is also not WTF-16, JavaScript is and the web consists of more than just that. WTF-16 to WTF-16 is most likely becoming less and less a thing going forward except for legacy interfaces such as W APIs on Windows and even there it appears that UTF-8 on the codepage level is now strongly recommended.
To give you another example: I'm very interested in using AssemblyScript today to do data processing, but that actually is not all that easy because the data I need to process is in UTF-8. Now to use the string class in AssemblyScript I actually have to do a pointless data conversion to WTF-16 and back.
I would be majorly surprised if JavaScript doesn’t adopt UTF-8 at one point as well.