Hacker News new | ask | show | jobs
by conrad-watt 1795 days ago
For reference, this is the lead maintainer of AS.

You previously posted yourself that documenting sanitisation at the component boundary would be an acceptable solution: (https://web.archive.org/web/20210726140105if_/https://github...).

I don't understand why you have so radically changed your opinion since then.

1 comments

At that time, it was ~ "you either stop advocating for the security concern, or we make sure you get nothing at all". In fact, I believe we would need to trap to at least make the breakage non-silent, but that would break AssemblyScript even more obviously and only exchange data corruption with denial of service. Not necessarily an improvement if you care about people using what you are building. In hindsight I am not proud of that comment and should have protested, no matter how dire the situation.
I don't agree with your representation that sanitisation of isolated surrogates constitutes "corruption". As a high-level point, when passing a string from your component to an external one, the external component receives a sanitised copy of your string - the original string is not modified in-place. So you still have access to your original string if you're relying on the presence of isolated surrogates for some reason.

For fairness, I will link below to your concrete example of "corruption", noting that you claim it will render Wasm "the biggest security disaster man ever created for everything that uses or opted to preserve the semantics of 16-bit Unicode".

https://github.com/w3ctag/design-principles/issues/322#issue...

I'd argue that the fundamental bug here is in splitting a string in between two code points which make up an emoji, creating isolated surrogates. This kind of mistake is common and can already cause logic and display errors in other parts of the code (e.g. for languages with non-BMP characters) independent of whether components are involved (again, I emphasise that no code using components has been written yet).

EDIT: I should also note that if it becomes necessary to transfer raw/invalid code points between components, the fallback of the `list u8` or `list u16` interface type always exists, although I acknowledge that the ergonomics may not be ideal, especially prior to adaptor functions existing.

This is common indeed, and isn't a bug in the affected source languages for reasons. How it displays when printed is irrelevant.

Here's Linus Torvalds explaining it better than I could: https://youtu.be/Pzl1B7nB9Kc?t=263

And sure you can transfer your string that someone else does not consider a string using alternative mechanisms, but then you are only not doing anything wrong because you are not doing it at all for entire categories of languages. There is no integration story for these, and once one mixes with optimizations like compact strings or has multiple encodings under the hood one cannot statically annotate the appropriate type anyhow. And sadly, adapter functions won't help as well when the fundamental 'char' type backing the 'string' type is already unable to represent your language's string.

I also do not understand where the idea that a single language always lives in a single component comes from. Certainly not from npm, NuGet, Maven or DLLs.

Extended this post to provide additional relevant context. It's not a bug, it's a feature.

I agree with the linked quote - it captures an important reason why it is valuable to _enforce_ sanitisation at component boundaries, rather than merely documenting "please don't rely on isolated surrogates being preserved across component boundaries" (which would be a problem if we didn't enforce it, since an external component you don't control may be forced to internally sanitise the string if it relies on (e.g.) an API, language runtime, or storage mechanism that admits only well-formed strings).

EDIT: since a whole other paragraph was edited in as I replied, I will respond by saying that within a component, your string can have whatever invalid representation you want. Most written code will naturally be a single component (which could even be made up of both JS and Wasm composed together through the JS API). The code may interface with other components, and this discussion is purely about what enforcement is appropriate at that boundary.

EDIT2: please consider a further reply to my post, rather than repeatedly editing your parent post in response. It is disorientating for observers. In any case, my paragraph above did not claim that there will be one component per language, but that the code _one writes oneself_ within a single language (or a collection of languages/libraries which can be tightly coupled through an API/build system) will naturally form one component.

Sure, we could resolve this problem by either a) giving these languages a separate fitting string type to use internally or externally (Rust for instance can use 'string' everywhere) or b) integrating their semantics into the single one so they are covered as well as first-class citizens. And coincidentally, that would fit JavaScript perfectly, which is rather surprising being off the table in a Web standard. Yet we are polling on having a "single" "list-of-USV" string type, likely closing the door for them forever with everything it implies.
There is no problem, assuming that one believes that the list-of-USV abstraction (i.e. sanitising strings to be valid unicode) is the right thing to enforce at the component boundary, _including_ when the internals of the component are implemented using JavaScript.

I appreciate that this is exactly the point where we currently disagree, and accept that I won't be able to convince you here. However, the AS website's announcement did not make the boundaries of the debate clear.