|
|
|
|
|
by ender7
4948 days ago
|
|
Apropos: http://mathiasbynens.be/notes/javascript-encoding TL;DR: - Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies). - However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters. var x = '𝌆';
x.length; // 2
x[0]; // \uD834
x[1]; // \uDF06
Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s). |
|
First you say that engines will "internally" replace non-BMP glyphs with the replacement character, but then you give an example that seems to work fine (and I think would work fine as long as you don't cut that character in half, or try to inspect its character code without doing the proper incantations[1].)
So, I guess what I'm asking is, at what point does the string become "internal", such that the engine will replace the character with the replacement character?
[1]: As given in the article you linked to.