Hacker News new | ask | show | jobs
by gbuk2013 777 days ago
No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6
You do have to be careful when working with binary data (e.g. streams) but this is expected.
3 comments

They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2
[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.
And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.
The 蛋糕 is a lie!