| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by njuw 775 days ago

They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2

[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

1 comments

gbuk2013 775 days ago

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.

link