Hacker News new | ask | show | jobs
by conrad-watt 1796 days ago
Full disclosure, I am an active participant in WebAssembly standardisation, my github is here (https://github.com/conrad-watt). What follows is purely my personal opinion.

This announcement is deliberately phrased to scare people who do not have sufficient context. I don't know why some AssemblyScript maintainers have decided to act in this extreme way over what is quite a niche issue. The vote that this announcement is sounding the alarm over is _not_ a vote on whether UTF-16 should be supported.

There has been a longstanding debate as part of the Wasm interface types proposal regarding whether UTF-8 should be privileged as a canonical string representation. Recently, we have moved in the direction of supporting both UTF-8 and UTF-16, although a vote to confirm this is still pending (but I personally believe would pass uncontroversially).

However, JavaScript strings are not always well-formed UTF-16 - in particular some validation is deferred for performance reasons, meaning that strings can contain invalid code points called isolated surrogates. Again, the referenced vote is _not_ a vote on whether UTF-16 should be supported, but is in fact a vote on whether we should require that invalid code points should be sanitised when strings are copied across component boundaries. Some AS maintainers have developed a strong opinion that such sanitisation would somehow be a webcompat/security hazard and have campaigned stridently against it. However sanitising strings in this way is actually a recommended security practice (https://websec.github.io/unicode-security-guide/character-tr...), so they haven't gained the traction they were hoping for with their objections.

The announcement is worded to obscure this point - talking about "JavaScript-like 16-bit string semantics" (i.e. where isolated surrogates are not sanitised) as opposed to merely "UTF-16", which forbids isolated surrogates by definition, but inviting the conflation of the two.

AS does not need to radically alter its string representation - if we were were to support UTF-16 with sanitisation, they could simply document that their potentially invalid UTF-16 strings will be sanitised when passed between components. Note that the component model is actually still being specified, so this design choice doesn't even affect any currently existing AS code. I interpret the announcement's threat of radical change as some maintainers holding AS hostage over the (again, very niche) string sanitisation issue, which is frankly pretty poor behaviour.

2 comments

The security aspect is separate from whether UTF-16 lowering and lifting is supported. One simply cannot roundtrip every possible DOMString, C#, Java etc. String through the single concept of Unicode Scalar Values without either introducing lots of surface area for a) silent data corruption (what you call "recommended security practice", i.e. strings not comparing equal anymore after what appears to be an innocent function call) or b) for (deliberate) denial of service (when erroring instead). I mean, there is a good reason why all these languages do not do that in between function calls, but actually try very hard to guarantee integrity. And in a real program, an Interface Types function looks like any other function to the developer, just imported somewhere in the codebase, so good luck documenting that. Other than that I do not know how to respond to your subtle insults, please forgive my ignorance.
For reference, this is the lead maintainer of AS.

You previously posted yourself that documenting sanitisation at the component boundary would be an acceptable solution: (https://web.archive.org/web/20210726140105if_/https://github...).

I don't understand why you have so radically changed your opinion since then.

At that time, it was ~ "you either stop advocating for the security concern, or we make sure you get nothing at all". In fact, I believe we would need to trap to at least make the breakage non-silent, but that would break AssemblyScript even more obviously and only exchange data corruption with denial of service. Not necessarily an improvement if you care about people using what you are building. In hindsight I am not proud of that comment and should have protested, no matter how dire the situation.
I don't agree with your representation that sanitisation of isolated surrogates constitutes "corruption". As a high-level point, when passing a string from your component to an external one, the external component receives a sanitised copy of your string - the original string is not modified in-place. So you still have access to your original string if you're relying on the presence of isolated surrogates for some reason.

For fairness, I will link below to your concrete example of "corruption", noting that you claim it will render Wasm "the biggest security disaster man ever created for everything that uses or opted to preserve the semantics of 16-bit Unicode".

https://github.com/w3ctag/design-principles/issues/322#issue...

I'd argue that the fundamental bug here is in splitting a string in between two code points which make up an emoji, creating isolated surrogates. This kind of mistake is common and can already cause logic and display errors in other parts of the code (e.g. for languages with non-BMP characters) independent of whether components are involved (again, I emphasise that no code using components has been written yet).

EDIT: I should also note that if it becomes necessary to transfer raw/invalid code points between components, the fallback of the `list u8` or `list u16` interface type always exists, although I acknowledge that the ergonomics may not be ideal, especially prior to adaptor functions existing.

This is common indeed, and isn't a bug in the affected source languages for reasons. How it displays when printed is irrelevant.

Here's Linus Torvalds explaining it better than I could: https://youtu.be/Pzl1B7nB9Kc?t=263

And sure you can transfer your string that someone else does not consider a string using alternative mechanisms, but then you are only not doing anything wrong because you are not doing it at all for entire categories of languages. There is no integration story for these, and once one mixes with optimizations like compact strings or has multiple encodings under the hood one cannot statically annotate the appropriate type anyhow. And sadly, adapter functions won't help as well when the fundamental 'char' type backing the 'string' type is already unable to represent your language's string.

I also do not understand where the idea that a single language always lives in a single component comes from. Certainly not from npm, NuGet, Maven or DLLs.

Extended this post to provide additional relevant context. It's not a bug, it's a feature.

I agree with the linked quote - it captures an important reason why it is valuable to _enforce_ sanitisation at component boundaries, rather than merely documenting "please don't rely on isolated surrogates being preserved across component boundaries" (which would be a problem if we didn't enforce it, since an external component you don't control may be forced to internally sanitise the string if it relies on (e.g.) an API, language runtime, or storage mechanism that admits only well-formed strings).

EDIT: since a whole other paragraph was edited in as I replied, I will respond by saying that within a component, your string can have whatever invalid representation you want. Most written code will naturally be a single component (which could even be made up of both JS and Wasm composed together through the JS API). The code may interface with other components, and this discussion is purely about what enforcement is appropriate at that boundary.

EDIT2: please consider a further reply to my post, rather than repeatedly editing your parent post in response. It is disorientating for observers. In any case, my paragraph above did not claim that there will be one component per language, but that the code _one writes oneself_ within a single language (or a collection of languages/libraries which can be tightly coupled through an API/build system) will naturally form one component.

If Interface Types doesn't convert WTF strings to UTF, the flaw won't have to be documented, and the risk that someone forgets to do the right thing and causes a program to break will be eliminated. This seems like a better outcome. I haven't been able to think of downsides of not "sanitizing". Can you list those?