| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by driverdan 4995 days ago
	If you search for V8 UCS-2 you'll find a lot of discussion on this issue dating back at least a few years. There are ways to work around V8's lack of support for surrogate pairs. See this V8 issue for ideas: https://code.google.com/p/v8/issues/detail?id=761 My question is why does V8 (or anything else) still use UCS-2?

3 comments

gsnedders 4995 days ago

The ES5 spec defines a string as being a series of UTF-16 code-units, which inherently means surrogates show through.

APIs like that tend to be low priority because they aren't used by browsers (which pass everything through as UTF-16 code-units, typically treating them as possibly-valid UTF-16 strings).

link

masklinn 4994 days ago

> My question is why does V8 (or anything else) still use UCS-2?

Because the ES spec defines a string as a sequence of UTF-16 code units (aka UCS-2-with-visible-surrogates), because as many others (e.g. Java) the language's strings were created during/inherited from Unicode 1.0 which fit in 16 bits (UTF-16 is a retrofitting of Unicode 1.0 fixed-width to accomodate the full range of later unicode version by adding surrogate pairs)

link

est 4995 days ago

because counting 2 bytes is much faster for computers than counting vary 1, 2, 3 or even 4 bytes.

link

speleding 4994 days ago

This is not a real issue because counting code points in an UTF8 string is easy too: the encoding is cleverly defined such that you just need to check the number of bytes that have the top bit cleared. Since UTF8 strings are generally shorter it can even be faster than counting UTF-16 if you don't know the length in advance.

link