Hacker News new | ask | show | jobs
by driverdan 4948 days ago
If you search for V8 UCS-2 you'll find a lot of discussion on this issue dating back at least a few years. There are ways to work around V8's lack of support for surrogate pairs. See this V8 issue for ideas: https://code.google.com/p/v8/issues/detail?id=761

My question is why does V8 (or anything else) still use UCS-2?

3 comments

The ES5 spec defines a string as being a series of UTF-16 code-units, which inherently means surrogates show through.

APIs like that tend to be low priority because they aren't used by browsers (which pass everything through as UTF-16 code-units, typically treating them as possibly-valid UTF-16 strings).

> My question is why does V8 (or anything else) still use UCS-2?

Because the ES spec defines a string as a sequence of UTF-16 code units (aka UCS-2-with-visible-surrogates), because as many others (e.g. Java) the language's strings were created during/inherited from Unicode 1.0 which fit in 16 bits (UTF-16 is a retrofitting of Unicode 1.0 fixed-width to accomodate the full range of later unicode version by adding surrogate pairs)

because counting 2 bytes is much faster for computers than counting vary 1, 2, 3 or even 4 bytes.
This is not a real issue because counting code points in an UTF8 string is easy too: the encoding is cleverly defined such that you just need to check the number of bytes that have the top bit cleared. Since UTF8 strings are generally shorter it can even be faster than counting UTF-16 if you don't know the length in advance.