Hacker News new | ask | show | jobs
by lmcarreiro 2779 days ago
I liked this one:

The quick brown fox jumps over the lazy dog

𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠

𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌

𝑻𝒉𝒆 𝒒𝒖𝒊𝒄𝒌 𝒃𝒓𝒐𝒘𝒏 𝒇𝒐𝒙 𝒋𝒖𝒎𝒑𝒔 𝒐𝒗𝒆𝒓 𝒕𝒉𝒆 𝒍𝒂𝒛𝒚 𝒅𝒐𝒈

𝓣𝓱𝓮 𝓺𝓾𝓲𝓬𝓴 𝓫𝓻𝓸𝔀𝓷 𝓯𝓸𝔁 𝓳𝓾𝓶𝓹𝓼 𝓸𝓿𝓮𝓻 𝓽𝓱𝓮 𝓵𝓪𝔃𝔂 𝓭𝓸𝓰

𝕋𝕙𝕖 𝕢𝕦𝕚𝕔𝕜 𝕓𝕣𝕠𝕨𝕟 𝕗𝕠𝕩 𝕛𝕦𝕞𝕡𝕤 𝕠𝕧𝕖𝕣 𝕥𝕙𝕖 𝕝𝕒𝕫𝕪 𝕕𝕠𝕘

𝚃𝚑𝚎 𝚚𝚞𝚒𝚌𝚔 𝚋𝚛𝚘𝚠𝚗 𝚏𝚘𝚡 𝚓𝚞𝚖𝚙𝚜 𝚘𝚟𝚎𝚛 𝚝𝚑𝚎 𝚕𝚊𝚣𝚢 𝚍𝚘𝚐

⒯⒣⒠ ⒬⒰⒤⒞⒦ ⒝⒭⒪⒲⒩ ⒡⒪⒳ ⒥⒰⒨⒫⒮ ⒪⒱⒠⒭ ⒯⒣⒠ ⒧⒜⒵⒴ ⒟⒪⒢

2 comments

Google understand almost all of them (it misses just the last one) and the first result is https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over...

Bing doesn't understand any, if you search it, the first result is the github repo with these frases.

So did I. But I'd appreciate if someone could explain how it works.
From the outset Unicode's goal (more so than ISO 10646 though now they're one and the same) was to unify all existing character sets, so you'd only need one.

Necessarily then, there should not be other sets that encode things you can't in Unicode, since then you can't displace those with Unicode.

So, particularly in the early life of Unicode the goal was collect stuff that already exists and add it to Unicode. (These days we're finished with that and most new work is on adding things that weren't previously in any character set)

Two controversial things were done, at opposite ends of the spectrum, during this period of consolidation:

What you're seeing here is adding copies of the entire Latin alphabet, but with some particular property that Latin users would not really consider part of the character, such as "bold" or "italic" but which _was_ preserved in some character set being used somewhere. Without this choice, if we converted a text file encoded in a way that distinguished bold and italic characters, we'd lose that bold/ italic and it might be significant. This would be like when you get a black & white photocopy of a sheet that says

"Ignore any text below shown in red"

Um, but none of this text is red? Oh. Probably some of it was before it was photocopied. Oops.

At the far end of the spectrum, a process called CJK unification took place in which scholars of the languages using characters from the Han ("Chinese") writing system decided that although say, a Japanese character set and a Chinese character set both had a particular character, and the Chinese and Japanese would not draw this character the same way, actually in some linguistic sense it's the same character (and in many cases the visual differences are quite small) and so Unicode should not encode both separately.

There's a coherent technical argument for why both these types of decisions made sense, but they were nonetheless controversial.

You should not use weird characters like italic Latin letters in new documents, but you also should not transform these characters without warning when processing an existing document as you may lose important meaning.

Thanks for the write-up.

Both had always bothered me deeply, but I'd never stopped to think that they're also essentially opposed in philosophy to each other. So now that I'm aware of that, I'm triply annoyed :S

One of the reason for these sets is mathematics ℜ <> ℝ in a math text (and BTW the math symbols ℂℍℕℙℚℝ in the double strike set are "out of sequence" which can be a nasty surprise if you do naive incrementation.
And ℤ. The reason these double-struck symbols are in a weird place (U+2100-214f, separate from the rest in U+1d400-1d7ff) is because they all have commonly used special meanings in mathematics -- they're used to represent the sets of all numbers of various types. ℂ = complex numbers, ℍ = quaternions, ℕ = natural numbers, ℚ = rational numbers, ℝ = real numbers, ℤ = integers.
There are three slightly different things going on.

The first line, The quick brown fox, originates with east-Asian character based terminals, on which ideographic characters occupied twice the space of alphabetic characters, and there was also a desire to have latin characters that were also double width. See https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

The middle lines are included as mathematical symbols. The justification is that 𝑖 is a mathematical symbol that has its own independent meaning, which only coincidentally looks like italicized i. (I think this is silly, and naturally leads to a bloody mess as people misuse these symbols as letters, and in this case there is no backwards-compatbility excuse.) https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symb...

The final line, like the first, is apparently present for compatibility with pre-Unicode east-Asian character sets. https://en.wikipedia.org/wiki/Enclosed_Alphanumerics

For some reason unicode includes a few characters in a different "font".
Interesting. This is the first thing I thought, but when I fed "lazy" into google, it happily accepted and displayed the results, so I thought there might be something else. But teh text editors that I tried indeed don't match the characters when I search them.
google is using unicode equivalence[1] to remap back to "standard" latin characters. this is important because, e.g., professional type-setting software may replace two adjacent "f" characters with a double-F ligature "ff" depending on kerning. without unicode equivalence, google would fail on a lot of copy-paste queries.

[1] https://en.wikipedia.org/wiki/Unicode_equivalence

They are just different Unicode glyphs on the same font.