| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cmccabe 4995 days ago

UCS2 allows random access to characters, UTF-16 does not.

I'm not sure if that's really true. On IBM's site, they define 3 levels of UCS-2, only one of which excludes "combining characters" (really code points).

http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...

If you have combining characters, then you can't simply take the number of bytes and divide by 2 to get the number of letters. If you don't have combining characters, then you have something which isn't terribly useful except for European languages (I think?)

Maybe someone more familiar with the implementation can describe which path they actually went down for this... given what I've heard so far, I'm not optimistic.

1 comments

pbiggar 4995 days ago

OK, I cracked into the V8 source to take a look at what actually happens. It looks like the implementation does use random access for two-byte strings. However, it also uses multiple multiple string implementations (ASCII, 2 byte strings, "consString" (I presume some kind of Rope), "Sliced Strings" (sounds like a rope again, but might be shared storage of the string contents for immutable strings)), so they could likely use other implementations with whatever properties they choose.

See https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2... and https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2....

cmccabe 4995 days ago

Ugh, I just realized this whole article is a farce. v8 added UTF-16 support earlier this year. Its support for non-BMP code points is now on par with java (although it has many of the same limitations)