Hacker News new | ask | show | jobs
by dcode 1303 days ago
I do understand the desire to switch all languages and systems to one encoding, of course. However, Switching a WTF-16 language to UTF-8 removes previously valid values from strings, then exchanging what's errors/mutation on Component boundaries right now with errors/mutation when using string APIs. Can't be done in a backwards-compatible way, and all these languages have a lot of existing code. If backwards compatibility is a goal (say when using a breadcrumbs mechanism as in Swift), one still ends up with WTF-8 underneath, which maps to WTF-16, but is not UTF-8. Hence why I think it's impossible, because the only way to pull this off is by replacing affected string APIs (and/or accepting that old APIs then throw or mutate). Likewise, I see a possible future where JS adopts breadcrumbs, but then with WTF-8 (and perhaps a well-formedness flag), not guaranteed UTF-8. In your use case, that would yield a fast-path if a string is well-formed, but still with the same old fallback. Plus, of course, that having a systems fast-path implies that there is a corresponding JS-interop slow-path (when using AS).
1 comments

PyPy uses utf-8 internally and it’s completely hidden from the user. That’s however possible because in Python there was always a UCS2/USC4 leak to the user code so you could never really rely on anything.

I expect other languages to make the switch sooner or later.

I do think though that this is not all that interesting for the issue here. WASI needs to pick some format and picking UTF-8 is fine. Roundtripping half broken UTF-16 is something that does not need preserving.

I think enforcing UTF-8 there won’t be much of an issue in practice.

I guess we are about to find out whether there is substance to the precedents. My bet is on "what can go wrong, will go wrong", even more so on Web scale. Let's hope I'm wrong.