Hacker News new | ask | show | jobs
by dcode 1794 days ago
This is common indeed, and isn't a bug in the affected source languages for reasons. How it displays when printed is irrelevant.

Here's Linus Torvalds explaining it better than I could: https://youtu.be/Pzl1B7nB9Kc?t=263

And sure you can transfer your string that someone else does not consider a string using alternative mechanisms, but then you are only not doing anything wrong because you are not doing it at all for entire categories of languages. There is no integration story for these, and once one mixes with optimizations like compact strings or has multiple encodings under the hood one cannot statically annotate the appropriate type anyhow. And sadly, adapter functions won't help as well when the fundamental 'char' type backing the 'string' type is already unable to represent your language's string.

I also do not understand where the idea that a single language always lives in a single component comes from. Certainly not from npm, NuGet, Maven or DLLs.

Extended this post to provide additional relevant context. It's not a bug, it's a feature.

1 comments

I agree with the linked quote - it captures an important reason why it is valuable to _enforce_ sanitisation at component boundaries, rather than merely documenting "please don't rely on isolated surrogates being preserved across component boundaries" (which would be a problem if we didn't enforce it, since an external component you don't control may be forced to internally sanitise the string if it relies on (e.g.) an API, language runtime, or storage mechanism that admits only well-formed strings).

EDIT: since a whole other paragraph was edited in as I replied, I will respond by saying that within a component, your string can have whatever invalid representation you want. Most written code will naturally be a single component (which could even be made up of both JS and Wasm composed together through the JS API). The code may interface with other components, and this discussion is purely about what enforcement is appropriate at that boundary.

EDIT2: please consider a further reply to my post, rather than repeatedly editing your parent post in response. It is disorientating for observers. In any case, my paragraph above did not claim that there will be one component per language, but that the code _one writes oneself_ within a single language (or a collection of languages/libraries which can be tightly coupled through an API/build system) will naturally form one component.

Sure, we could resolve this problem by either a) giving these languages a separate fitting string type to use internally or externally (Rust for instance can use 'string' everywhere) or b) integrating their semantics into the single one so they are covered as well as first-class citizens. And coincidentally, that would fit JavaScript perfectly, which is rather surprising being off the table in a Web standard. Yet we are polling on having a "single" "list-of-USV" string type, likely closing the door for them forever with everything it implies.
There is no problem, assuming that one believes that the list-of-USV abstraction (i.e. sanitising strings to be valid unicode) is the right thing to enforce at the component boundary, _including_ when the internals of the component are implemented using JavaScript.

I appreciate that this is exactly the point where we currently disagree, and accept that I won't be able to convince you here. However, the AS website's announcement did not make the boundaries of the debate clear.