Hacker News new | ask | show | jobs
by zubi 2779 days ago
So did I. But I'd appreciate if someone could explain how it works.
4 comments

From the outset Unicode's goal (more so than ISO 10646 though now they're one and the same) was to unify all existing character sets, so you'd only need one.

Necessarily then, there should not be other sets that encode things you can't in Unicode, since then you can't displace those with Unicode.

So, particularly in the early life of Unicode the goal was collect stuff that already exists and add it to Unicode. (These days we're finished with that and most new work is on adding things that weren't previously in any character set)

Two controversial things were done, at opposite ends of the spectrum, during this period of consolidation:

What you're seeing here is adding copies of the entire Latin alphabet, but with some particular property that Latin users would not really consider part of the character, such as "bold" or "italic" but which _was_ preserved in some character set being used somewhere. Without this choice, if we converted a text file encoded in a way that distinguished bold and italic characters, we'd lose that bold/ italic and it might be significant. This would be like when you get a black & white photocopy of a sheet that says

"Ignore any text below shown in red"

Um, but none of this text is red? Oh. Probably some of it was before it was photocopied. Oops.

At the far end of the spectrum, a process called CJK unification took place in which scholars of the languages using characters from the Han ("Chinese") writing system decided that although say, a Japanese character set and a Chinese character set both had a particular character, and the Chinese and Japanese would not draw this character the same way, actually in some linguistic sense it's the same character (and in many cases the visual differences are quite small) and so Unicode should not encode both separately.

There's a coherent technical argument for why both these types of decisions made sense, but they were nonetheless controversial.

You should not use weird characters like italic Latin letters in new documents, but you also should not transform these characters without warning when processing an existing document as you may lose important meaning.

Thanks for the write-up.

Both had always bothered me deeply, but I'd never stopped to think that they're also essentially opposed in philosophy to each other. So now that I'm aware of that, I'm triply annoyed :S

One of the reason for these sets is mathematics ℜ <> ℝ in a math text (and BTW the math symbols ℂℍℕℙℚℝ in the double strike set are "out of sequence" which can be a nasty surprise if you do naive incrementation.
And ℤ. The reason these double-struck symbols are in a weird place (U+2100-214f, separate from the rest in U+1d400-1d7ff) is because they all have commonly used special meanings in mathematics -- they're used to represent the sets of all numbers of various types. ℂ = complex numbers, ℍ = quaternions, ℕ = natural numbers, ℚ = rational numbers, ℝ = real numbers, ℤ = integers.
There are three slightly different things going on.

The first line, The quick brown fox, originates with east-Asian character based terminals, on which ideographic characters occupied twice the space of alphabetic characters, and there was also a desire to have latin characters that were also double width. See https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

The middle lines are included as mathematical symbols. The justification is that 𝑖 is a mathematical symbol that has its own independent meaning, which only coincidentally looks like italicized i. (I think this is silly, and naturally leads to a bloody mess as people misuse these symbols as letters, and in this case there is no backwards-compatbility excuse.) https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symb...

The final line, like the first, is apparently present for compatibility with pre-Unicode east-Asian character sets. https://en.wikipedia.org/wiki/Enclosed_Alphanumerics

For some reason unicode includes a few characters in a different "font".
Interesting. This is the first thing I thought, but when I fed "lazy" into google, it happily accepted and displayed the results, so I thought there might be something else. But teh text editors that I tried indeed don't match the characters when I search them.
google is using unicode equivalence[1] to remap back to "standard" latin characters. this is important because, e.g., professional type-setting software may replace two adjacent "f" characters with a double-F ligature "ff" depending on kerning. without unicode equivalence, google would fail on a lot of copy-paste queries.

[1] https://en.wikipedia.org/wiki/Unicode_equivalence

They are just different Unicode glyphs on the same font.