| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ncmncm 2387 days ago
	There is way more than enough wacky stuff introduced by Unicode. Having dozens of letters A, for example. And giving a Japanese Kanji character the same code as a Chinese one that usually looks similar.

4 comments

jrochkind1 2387 days ago

Unicode did not introduce having dozens of letters A, they existed without unicode. Unicode just gives you a way to represent them -- and bonus, often to normalize them all to a normal letter A too.

It is a mistake to think that Unicode has the ability to people's text behavior by not supporting things. I mean, maybe it does now that it's so popular, but in order to get adoption it had to support what people were actually doing.

People had use cases to put "𝔄" adn "A" in the same document and keep them distinct without unicode. It is not a service for unicode to refuse to let them; and if it tried, it would just lead to people finding a way to do it in a non-standardized or non-unicode-standardized way anyway, which still wouldn't help anyone.

You might just as well say "I don't understand why we need lowercase AND capital letters, the standard is just complicating things supporting both -- the ancient romans didn't need two cases after all"

tomp 2386 days ago

There's no way to distinguish "A" "uppercase a" and "Α" "uppercase α" in written text, but they're different Unicode letters (and might be rendered differently depending on font).

msla 2387 days ago

> There is way more than enough wacky stuff introduced by Unicode. Having dozens of letters A, for example.

You'd have to go back before Unicode to prevent this.

Unicode was created with certain engineering constraints, one of them being round-trip compatibility. This means that it needs to be possible to go from $OTHER_ENCODING -> Unicode -> $OTHER_ENCODING and get a result which is bitwise-identical to the input. In short, Unicode is saddled with the fact pre-Unicode text encoding was a mess, plus the fact people tend to not like irreversible format changes.

sansnomme 2387 days ago

Yeah, that is quite inconsistent. Kanji literally means "Chinese Character" so it should be the same for the letter A. Unless a French A isn't equivalent to an English A.

jcranmer 2387 days ago

Arabic numerals (0123456789) are not to be confused with the Arabic numerals (٠١٢٣٤٥٦٧٨٩). So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script.

The Latin script (that which I write right now) and the Cyrillic script both derived heavily from the Greek script, especially the capital letters--fully 60% of them are identical in Latin and Greek, even more if you include obsolete letters like digamma and lunate sigma (roughly F and C, respectively). Most of these homoglyphs furthermore share identical phonetic values.

In retrospect, treating traditional Chinese, simplified Chinese, and Japanese kanji as different scripts seems like it would have been the better path. I don't know enough about the Korean and Vietnamese usage of Chinese characters to know if those scripts are themselves independent daughter scripts or complete imports of Chinese with a few extra things thrown in (consider Farsi's additions to Arabic, or Icelandic's þ and ð additions to Latin).

tasogare 2386 days ago

> So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script.

The Japanese writing system differs from the Chinese one by having its own distincts scripts (hiragana, katakana), but most of its subset made of Chinese characters (kanji) is the same than the Chinese script. The most comprehensive Chinese character dictionary is a Japanese one (Daikanwa jiten), which give definition and Japanese readings and this is possible precisely because the script is the same.

The only differences are characters created for use in Japan (kokuji) which can be treated as an extension like the Vietnamese Nôm, characters simplified by the Japanese government (some jôyô kanji) and variation in some glyph's shape (黃/黄). So, treating the full inventory of these languages as different scripts wouldn't make more sense than encoding the English, French and Czech alphabets separately because few characters differ.

My opinion is that Han unification makes sense, but the mistake made was to encode the variant interpretation at application level (e.g. html lang tags), which is not portable. I don't know how Unicode variant form works in details (putting a trailing code to a character to indicate precisely which variant is meant) but something like that at text encoding level could ease a lot of pain.

mantap 2387 days ago

The alternative is worse. Look at all of the problems we have with Turkish I, just because they didn't create new codepoints to make Turkish I and Latin I distinct even though they look the same.

Shorel 2387 days ago

Cyrillic А is not the same as English A.

For example, some fonts render A in a way that looks like Cyrilic Л. (Like The Mandalorian title screen.)

This would be incorrect if using the same A for both: https://i.ytimg.com/vi/V8fC7bdV-mI/maxresdefault.jpg

bonoboTP 2387 days ago

I don't see anything special in the linked image. The As look as they would in Latin script.

(probably not what you meant, but just in case: the fourth letter is not a Cyrillic A but a D.)

GrantSolar 2386 days ago

I think the image is not meant to show the problem but show a case where if the Cyrillic A had been stylised the same way that the English A is in the English version, the two distinct letters would become indistinguishable such that the Cyrillic title would effectively read "The Mlndlloriln"

bonoboTP 2386 days ago

I see, for those out of the loop, the English title screen does not have the horizontal bar in the As.

Shorel 2385 days ago

I should have linked the English title as well, now I read my comment and see it is not sufficiently unambiguous.

Here: https://en.wikipedia.org/wiki/File:The_Mandalorian_logo.jpg you can see that the A letters are rendered the same as Cyrillic Л.

To further the confusion, in Buenos Aires there are many street signs that use Л as the letter A, and they use П as the letter N.

bonoboTP 2385 days ago

Part of the confusion was because I see the English title's As are actually Λ (capital Greek lambda), rather than Л (at least in the font that HN uses). I'm guessing from context that Л is sometimes rendered as Λ in some Cyrillic fonts.

Shorel 2384 days ago

> I'm guessing from context that Л is sometimes rendered as Λ in some Cyrillic fonts.

Exactly. It is more often rendered like that in Bulgaria. But it is still the letter Л.

Which just furthers the point that glyph rendering and character code points are very different problems and the multiple code points in Unicode are the right approach.

tomp 2386 days ago

If the readers were to confuse A and Д, that's a problem with the font, not the letters. Cyrillic, Greek and Latin A are all one letter (in uppercase).

mehrdadn 2387 days ago

That might be wacky to you but I'm not sure it's wacky to the people to whom it makes a difference.

ncmncm 2387 days ago

Lots of the wacky stuff came from the original dream of a purely 16-bit code, and then more wacky stuff to extend it from there. I.e., starting from UTF-8 could have avoided any amount of unpleasantness. But of course UTF-8 wasn't invented until later. The 16-bit representation got encrusted in OSes and languages of a certain period.

The same goes, of course, for writing systems, going back to the first, that we would all do differently in hindsight.

Even today we are making apparently sensible choices we (or our digital successors) will regret as deeply.

ISO 8601 looks good now, but it only delays the transition to a rational calendar which, admittedly, we would certainly get wrong if we tried codifying one today.

Fortunately daylight saving time will be gone worldwide before the next decade passes, but not without some places getting stuck at the wrong timezone. (E.g. Portugal different from Spain, and probably Indiana different from itself.)

kevin_thibedeau 2387 days ago

Indiana doesn't have special time zones any more. It allows individual counties to choose which standard zone to be in but they all observe the normal DST.