Hacker News new | ask | show | jobs
by SXX 778 days ago
While proper description is linked already in neighbouring comment.

TLDR: in 2024 with PHP8 you still need mbstring extension and also you should be careful around UTF-8 if you do any text processing. In almost all other modern programming languages it's just works.

2 comments

Anyone claiming such things doesn't understand Unicode at all.

The whole concept of having special Unicode-aware strlen or substr is nonsense.

In NodeJS for example don't you have use Buffers and special decoders to deal with UTF-8 strings? I.e it's a pain there too.
I don't think that's a pain. It's making explicit what should be explicit and the decoded string doesn't have an encoding attached (like in Ruby), it can't be in an unexpected format, it's always UTF-16. One can argue about weather UTF-16 is the best choice, but at least it's always that and always Unicode. No surprises.
No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6
You do have to be careful when working with binary data (e.g. streams) but this is expected.
They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2
[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.
And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.
The 蛋糕 is a lie!