| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gbuk2013 777 days ago

No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6

You do have to be careful when working with binary data (e.g. streams) but this is expected.

3 comments

njuw 777 days ago

They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2

[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

link

gbuk2013 776 days ago

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.

link

trurl42 777 days ago

And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

link

kiitos 775 days ago

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.

link

DonHopkins 777 days ago

The 蛋糕 is a lie!

link