Hacker News new | ask | show | jobs
by hnarn 778 days ago
Can you elaborate on what exactly you mean by “natively support”?
2 comments

I’d argue that it does natively support MultiByte strings, albeit with an extension library that’s part of the language.

Even your supplied link states they’re not supported at a “low level”, but states nothing about “native”.

If you want to be pedantic then multi-byte string !== UTF-8 support. ;)

Consider the intended purpose of the language and then consider whether the abstraction offered is appropriate. IMO in the case of PHP and UTF-8 it is not.

In my specific case it made my job harder than I would like on 2 projects I used PHP for, which is why I am complaining.

While proper description is linked already in neighbouring comment.

TLDR: in 2024 with PHP8 you still need mbstring extension and also you should be careful around UTF-8 if you do any text processing. In almost all other modern programming languages it's just works.

Anyone claiming such things doesn't understand Unicode at all.

The whole concept of having special Unicode-aware strlen or substr is nonsense.

In NodeJS for example don't you have use Buffers and special decoders to deal with UTF-8 strings? I.e it's a pain there too.
I don't think that's a pain. It's making explicit what should be explicit and the decoded string doesn't have an encoding attached (like in Ruby), it can't be in an unexpected format, it's always UTF-16. One can argue about weather UTF-16 is the best choice, but at least it's always that and always Unicode. No surprises.
No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6
You do have to be careful when working with binary data (e.g. streams) but this is expected.
They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2
[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.
And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.
The 蛋糕 is a lie!