Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.
[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.
> surrogates, regardless of whether they’re paired, are invalid in UTF-8
Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding.
It depends on the language and/or used libraries. E.g. in Go, the problem does not exist, because it uses UTF-32; Rust uses UTF-8, but it makes sure that you can't cut a string between bytes that belong to the same character.
Fun Java/macos quirk: macos normalizes file names, so you can't have two files called ü in the same directory by writing ü as a single character and as composing characters. But unfortunately, this only happens on write, not on read, so if you type an ü on a German keyboard (produces a single character) into the Java source code file when writing a file name, the file will be saved with the decomposed name instead, but when trying to open the file, it will not be found when trying to open it with the single character name.
No, the language did not handle it fine. It allowed an invalid Unicode string to exist. This is basically a UTF-16 affliction—nothing that does UTF-16 validates, whereas almost everything that does UTF-8 does validate. encodeURIComponent deals with UTF-8, so of course it throws.
I'm realizing `encodeURIComponent` is actually part of the ECMA spec! I thought it was something provided by the browser like `window` or `document`. I withdraw my "the language handled it fine" comment, haha.
Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine.
e.g.
// valid
String.fromCodePoint(0xd83e, 0xdd20)
// invalid, but "�" is ... fine?
String.fromCodePoint(0xdd20, 0xd83e)
In Rust, an invalid Unicode string simply cannot exist (* unless you use unsafe, but all bets are off then). An important part of this is that the code unit, the scalar value and the string are three different types (u8, char, str). Iteration must decide if it wants to go by code unit or by scalar value (… or by extended grapheme cluster, but that’s not provided in std).
JavaScript’s problems start with not having separate code unit or scalar value types. Sequences of UTF-16 code units, individual UTF-16 code units and scalar values all use the type string. (Code unit and scalar value also both use number in some contexts.)
The first step to fixing JavaScript’s bad semantics would be separating the code unit and scalar value types. If you did that… the changes required to support strict strings are perhaps surprisingly small. Even migrating to UTF-8 semantics is not very hard then.
Unfortunately, JavaScript seems very determined to do stupid things and allow stupid things and then do more stupid things with the stupid things it foolishly allowed.
[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.