| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sdiacom 1229 days ago

Kind of, sort of, not really. What they imply (by using the term "ASCII" here) is not correct, and I'm not sure how the assurance that the string does not contain astral characters helps them split a string by the `.` character. But JavaScript doesn't exactly "smooth over this" in a very useful way, either.

For legacy reasons, JavaScript's "character unit", the basic component of a string, is an "UTF-16 character", that is, sixteen bits that are interpreted as being UTF-16-encoded. That said, sixteen bits are not enough to represent all valid Unicode characters in the UTF-16 encoding. Instead, characters in the [supplemental planes] are represented in UTF-16 using two sixteen-bytes "non-characters", which do not individually map to any Unicode codepoint in any plane, but in combination reference an Unicode codepoint in one of the supplemental planes.

JavaScript's internal representation of strings, as well as the APIs it exposes for dealing with strings, such as index accessing and string length, treat each of the sixteen bit "halves" of the UTF-16 representation of a supplemental plane codepoint as individual characters.

This means that, when you index a string, you might get an UTF-16 character that represents a Unicode codepoint in the basic plane, or an UTF-16 "non-character" that, along with its other half, would represent an Unicode codepoint in one of the supplemental planes.

[supplemental planes]: https://en.wikipedia.org/wiki/Plane_(Unicode) (see planes 1 to 16)

1 comments

mhagemeister 1229 days ago

Author here.

That's great feedback! After reading your comment and re-reading the section in the article it does indeed sound wrong. Decided to remove that paragraph. Your explanation of the string representation is really good. Thanks for sharing!

link

sdiacom 1227 days ago

I'm glad it helped! Now that I'm actually looking at the different ways to manipulate strings in JavaScript and not going from memory, the traditional JavaScript "except when it doesn't" caveat applies.

It seems like _some_ string operations treat each surrogate (that's the fancy name for the half-characters) as its own character, while others (correctly) treat the surrogate pair as a single character.

This might explain how ensuring that the function name does not contain astral character would make it easier to use different string functions together without accidentally introducing bugs.

link