Hacker News new | ask | show | jobs
by pbiggar 4959 days ago
A couple of reasons why it makes sense for V8 and other vendors to use UCS2:

- The spec says UCS2 or UTF16. Those are the only options.

- UCS2 allows random access to characters, UTF-16 does not.

- Remember how the JS engines were fighting for speed on arbitrary benchmarks, and nobody cared about anything else for 5 years? UCS2 helps string benchmarks be fast!

- Changing from UCS2 to UTF-16 might "break the web", something browser vendors hate (and so do web developers)

- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to UTF-16? Because a Java VM only has to run one program at once! In JS, you can't specify a version, an encoding, and one engine has to run everything on the web. No migration path to other encodings!

1 comments

UCS2 allows random access to characters, UTF-16 does not.

I'm not sure if that's really true. On IBM's site, they define 3 levels of UCS-2, only one of which excludes "combining characters" (really code points).

http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...

If you have combining characters, then you can't simply take the number of bytes and divide by 2 to get the number of letters. If you don't have combining characters, then you have something which isn't terribly useful except for European languages (I think?)

Maybe someone more familiar with the implementation can describe which path they actually went down for this... given what I've heard so far, I'm not optimistic.

OK, I cracked into the V8 source to take a look at what actually happens. It looks like the implementation does use random access for two-byte strings. However, it also uses multiple multiple string implementations (ASCII, 2 byte strings, "consString" (I presume some kind of Rope), "Sliced Strings" (sounds like a rope again, but might be shared storage of the string contents for immutable strings)), so they could likely use other implementations with whatever properties they choose.

See https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2... and https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2....

Ugh, I just realized this whole article is a farce. v8 added UTF-16 support earlier this year. Its support for non-BMP code points is now on par with java (although it has many of the same limitations)