| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hnarn 778 days ago
	Can you elaborate on what exactly you mean by “natively support”?

2 comments

gbuk2013 778 days ago

https://phptherightway.com/#php_and_utf8

link

MissTake 778 days ago

I’d argue that it does natively support MultiByte strings, albeit with an extension library that’s part of the language.

Even your supplied link states they’re not supported at a “low level”, but states nothing about “native”.

link

gbuk2013 778 days ago

If you want to be pedantic then multi-byte string !== UTF-8 support. ;)

Consider the intended purpose of the language and then consider whether the abstraction offered is appropriate. IMO in the case of PHP and UTF-8 it is not.

In my specific case it made my job harder than I would like on 2 projects I used PHP for, which is why I am complaining.

link

SXX 778 days ago

While proper description is linked already in neighbouring comment.

TLDR: in 2024 with PHP8 you still need mbstring extension and also you should be careful around UTF-8 if you do any text processing. In almost all other modern programming languages it's just works.

link

mgaunard 778 days ago

Anyone claiming such things doesn't understand Unicode at all.

The whole concept of having special Unicode-aware strlen or substr is nonsense.

link

dfgdfg34545456 778 days ago

In NodeJS for example don't you have use Buffers and special decoders to deal with UTF-8 strings? I.e it's a pain there too.

link

panzi 778 days ago

I don't think that's a pain. It's making explicit what should be explicit and the decoded string doesn't have an encoding attached (like in Ruby), it can't be in an unexpected format, it's always UTF-16. One can argue about weather UTF-16 is the best choice, but at least it's always that and always Unicode. No surprises.

link

gbuk2013 778 days ago

No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6

You do have to be careful when working with binary data (e.g. streams) but this is expected.

link

njuw 778 days ago

They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2

[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

link

gbuk2013 778 days ago

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.

link

trurl42 778 days ago

And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

link

kiitos 777 days ago

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.

link

DonHopkins 778 days ago

The 蛋糕 is a lie!

link