|
|
|
|
|
by conrad-watt
1795 days ago
|
|
I don't agree with your representation that sanitisation of isolated surrogates constitutes "corruption". As a high-level point, when passing a string from your component to an external one, the external component receives a sanitised copy of your string - the original string is not modified in-place. So you still have access to your original string if you're relying on the presence of isolated surrogates for some reason. For fairness, I will link below to your concrete example of "corruption", noting that you claim it will render Wasm "the biggest security disaster man ever created for everything that uses or opted to preserve the semantics of 16-bit Unicode". https://github.com/w3ctag/design-principles/issues/322#issue... I'd argue that the fundamental bug here is in splitting a string in between two code points which make up an emoji, creating isolated surrogates. This kind of mistake is common and can already cause logic and display errors in other parts of the code (e.g. for languages with non-BMP characters) independent of whether components are involved (again, I emphasise that no code using components has been written yet). EDIT: I should also note that if it becomes necessary to transfer raw/invalid code points between components, the fallback of the `list u8` or `list u16` interface type always exists, although I acknowledge that the ergonomics may not be ideal, especially prior to adaptor functions existing. |
|
Here's Linus Torvalds explaining it better than I could: https://youtu.be/Pzl1B7nB9Kc?t=263
And sure you can transfer your string that someone else does not consider a string using alternative mechanisms, but then you are only not doing anything wrong because you are not doing it at all for entire categories of languages. There is no integration story for these, and once one mixes with optimizations like compact strings or has multiple encodings under the hood one cannot statically annotate the appropriate type anyhow. And sadly, adapter functions won't help as well when the fundamental 'char' type backing the 'string' type is already unable to represent your language's string.
I also do not understand where the idea that a single language always lives in a single component comes from. Certainly not from npm, NuGet, Maven or DLLs.
Extended this post to provide additional relevant context. It's not a bug, it's a feature.