Hacker News new | ask | show | jobs
by dcode 1384 days ago
The value spaces are in fact asymmetric. In Unicode jargon, UTF-8 is a "list of Unicode scalar values" while WTF-16, i.e. UTF-16 as seen in practice, is a "list of Unicode code points (except surrogate pairs)". Unicode code points are a superset of Unicode scalar values, aka "list of Unicode Code Points except surrogate code points", hence conversion from WTF-16 to UTF-8 is lossy. In WTF-16 languages, this is not a problem because the system is designed for WTF-16, but it becomes one if the value space is restricted further, here to UTF-8.

That's why, in such mixed systems, https://simonsapin.github.io/wtf-8/ is typically used. Not UTF-8. And Wasm is such a mixed system.