| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SXX 778 days ago
	While proper description is linked already in neighbouring comment. TLDR: in 2024 with PHP8 you still need mbstring extension and also you should be careful around UTF-8 if you do any text processing. In almost all other modern programming languages it's just works.

2 comments

mgaunard 778 days ago

Anyone claiming such things doesn't understand Unicode at all.

The whole concept of having special Unicode-aware strlen or substr is nonsense.

link

dfgdfg34545456 778 days ago

In NodeJS for example don't you have use Buffers and special decoders to deal with UTF-8 strings? I.e it's a pain there too.

link

panzi 778 days ago

I don't think that's a pain. It's making explicit what should be explicit and the decoded string doesn't have an encoding attached (like in Ruby), it can't be in an unexpected format, it's always UTF-16. One can argue about weather UTF-16 is the best choice, but at least it's always that and always Unicode. No surprises.

link

gbuk2013 778 days ago

No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6

You do have to be careful when working with binary data (e.g. streams) but this is expected.

link

njuw 778 days ago

They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2

[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

link

gbuk2013 778 days ago

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.

link

trurl42 778 days ago

And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

link

kiitos 777 days ago

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.

link

DonHopkins 778 days ago

The 蛋糕 is a lie!

link