Hacker News new | ask | show | jobs
by ender7 4948 days ago
Apropos: http://mathiasbynens.be/notes/javascript-encoding

TL;DR:

- Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies).

- However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters.

  var x = '𝌆';
  x.length; // 2
  x[0];     // \uD834
  x[1];     // \uDF06
Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s).
3 comments

I'm relatively comfortable with this stuff, but I am confused by your response.

First you say that engines will "internally" replace non-BMP glyphs with the replacement character, but then you give an example that seems to work fine (and I think would work fine as long as you don't cut that character in half, or try to inspect its character code without doing the proper incantations[1].)

So, I guess what I'm asking is, at what point does the string become "internal", such that the engine will replace the character with the replacement character?

[1]: As given in the article you linked to.

I dare not try and reexplain the discussion in this bug report as my understanding feels insufficient, but the entire discussion at http://code.google.com/p/v8/issues/detail?id=761#c14 (note, I've linked to the 14th commment in the discussion, but there's more interesting stuff above) talks about it. At the core is a distinction between v8's internal representation of strings and it's API vs. what a browser engine which embeds v8 might do.
Safari uses UTF-16, not UCS-2. I believe this is true of other browsers as well. Otherwise this would render the replacement char, but it doesn't, it renders correctly:

javascript:var x = '𝌆';document.write(x);

Well, a JS string is just a series of UTF-16 code-units (per ES5, there is no impl choice here), so there isn't really any encoding pre-se (and isn't necessarily a UTF-16 string, per the spec's definition thereof, as lone surrogates are valid). The fact that that works is more a testament to the the DOM being UTF-16 than JS.

(On the other hand, I'm sure you knew that. But probably there are people reading your comment who didn't. :))

You are technically correct, the best kind of correct! But I think we both agree there is absolutely no sense in which anything in browser engines is UCS-2, and that browsers will not in fact replace characters beyond the BMP with the replacement glyph, as the top-level comment claimed. It is kind of embarassing that the top rated comment (as of writing) but says completely false things.
I hate the implication in this comment and in the linked article that the spec is somehow immutable. The ECMAScript spec here is fundamentally flawed with regard to character encoding and needs to be fixed.

UCS-2 is not a valid Unicode encoding any more, because there are several sets of characters encoded outside the BMP. The spec should be updated to require UTF-16 support in all implementations.

If a modern programming language like JavaScript doesn't provide a way to represent characters outside the BMP in its character data type, that needs to be fixed too. Indexing and counting characters in a JavaScript string need to reflect the human and Unicode notion of characters, not the arbitrary 2-byte blocks that UCS-2 happens to use.

The language authors should be ashamed of this situation - having a modern language without proper Unicode support is simply awful.