| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by agus4nas 39 days ago
	Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime?

3 comments

amluto 39 days ago

Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.

[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.

link

chuckadams 39 days ago

> surrogates, regardless of whether they’re paired, are invalid in UTF-8

Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding.

link

shawn_w 39 days ago

If you use the string methods that work with code points instead of chars, you rarely if ever have to deal with surrogate pairs in Java.

link

RedNifre 39 days ago

It depends on the language and/or used libraries. E.g. in Go, the problem does not exist, because it uses UTF-32; Rust uses UTF-8, but it makes sure that you can't cut a string between bytes that belong to the same character.

Fun Java/macos quirk: macos normalizes file names, so you can't have two files called ü in the same directory by writing ü as a single character and as composing characters. But unfortunately, this only happens on write, not on read, so if you type an ü on a German keyboard (produces a single character) into the Java source code file when writing a file name, the file will be saved with the decomposed name instead, but when trying to open the file, it will not be found when trying to open it with the single character name.

link

georgemandis 39 days ago

The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything.

It was really `encodeURIComponent` that didn't handle it gracefully.

If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):

encodeURIComponent("\uD83E\uDD20")

If you give it an invalid surrogate pair, it will throw an actual error:

encodeURIComponent("\uDD20\uD83E")

link

chrismorgan 39 days ago

No, the language did not handle it fine. It allowed an invalid Unicode string to exist. This is basically a UTF-16 affliction—nothing that does UTF-16 validates, whereas almost everything that does UTF-8 does validate. encodeURIComponent deals with UTF-8, so of course it throws.

link

georgemandis 39 days ago

I'm realizing `encodeURIComponent` is actually part of the ECMA spec! I thought it was something provided by the browser like `window` or `document`. I withdraw my "the language handled it fine" comment, haha.

Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine.

e.g. // valid String.fromCodePoint(0xd83e, 0xdd20) // invalid, but "�" is ... fine? String.fromCodePoint(0xdd20, 0xd83e)

link

chrismorgan 38 days ago

In Rust, an invalid Unicode string simply cannot exist (* unless you use unsafe, but all bets are off then). An important part of this is that the code unit, the scalar value and the string are three different types (u8, char, str). Iteration must decide if it wants to go by code unit or by scalar value (… or by extended grapheme cluster, but that’s not provided in std).

JavaScript’s problems start with not having separate code unit or scalar value types. Sequences of UTF-16 code units, individual UTF-16 code units and scalar values all use the type string. (Code unit and scalar value also both use number in some contexts.)

The first step to fixing JavaScript’s bad semantics would be separating the code unit and scalar value types. If you did that… the changes required to support strict strings are perhaps surprisingly small. Even migrating to UTF-8 semantics is not very hard then.

Unfortunately, JavaScript seems very determined to do stupid things and allow stupid things and then do more stupid things with the stupid things it foolishly allowed.

link