Hacker News new | ask | show | jobs
by dcode 1385 days ago
Except that picking the most restrictive option prevents two components that are using the same less restrictive semantics to communicate with each other securely (without throwing exceptions or silently mutating data), e.g. Java <-> Java, AssemblyScript <-> JavaScript. To illustrate, if one would design a Component Model for Java, restricting to UTF-8 would make the Component Model for Java a hazard for Java. The same effect happens in a multi-language Component Model, where some languages then work and others don't. Hence "pick the best option" falls short. The argument in all these questionably stonewalled discussions is basically to allow these languages and use cases to exist, which could be as trivial as to make UTF-8 the default if the WASI folks so wish, but also have a Boolean flag for "don't eagerly mutate on WTF-16 pass-through". Yet, even though trivial and rather obvious, this has been fought relentlessly since 2017, and surely one has to wonder why this vehemence.
1 comments

UTF-8 is not "more restrictive". I'm not sure what you're talking about.
The value spaces are in fact asymmetric. In Unicode jargon, UTF-8 is a "list of Unicode scalar values" while WTF-16, i.e. UTF-16 as seen in practice, is a "list of Unicode code points (except surrogate pairs)". Unicode code points are a superset of Unicode scalar values, aka "list of Unicode Code Points except surrogate code points", hence conversion from WTF-16 to UTF-8 is lossy. In WTF-16 languages, this is not a problem because the system is designed for WTF-16, but it becomes one if the value space is restricted further, here to UTF-8.

That's why, in such mixed systems, https://simonsapin.github.io/wtf-8/ is typically used. Not UTF-8. And Wasm is such a mixed system.