| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by a3n 3498 days ago
	This is strange to me. This is clearly meant, in unicode, to be 'G' that we all know and love. It has uselessly expanded "the alphabet" (to be western-centric) in a confusable way. Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.

5 comments

hackuser 3498 days ago

> Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.

It brings up interesting, long-standing problems. Which of these count as the same letters?

* Letters in two languages with the same appearance and making the same phonetic sound

* Letters in two languages with the same appearance but making slightly different phonetic sounds. E.g., R in English and French

* Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?

* Letters in two languages with the same appearance but making completely different phonetic sounds.

* Similar (by any property) letters in two related languages; e.g., both Indo-European

* Similar (by any property) letters in two unrelated languages; e.g., French and Vietnamese.

* Letters with the same phonetic sound but different appearances.

* Letters with the same appearance, one is phonetic and one an ideograph

* Letters that are otherwise identical, but alphabetize differently in their respective languages

* EDIT: Forgot a key one; Letters that are otherwise identical, but follow different rules of how they combine with the letters around them (a common issue, though not familiar to English speakers).

* Letters that are in all ways identical but belong in different languages. In which languages code group does the letter belong? One? Both? What if the subset of Unicode supported by an application includes one language but not the other?

etc. etc.

link

vurpo 3498 days ago

> Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?

It gets worse than this. Example: the letters Ä and Ö exist in both Swedish and German (as an example).

In German they are actually counted as the letters A and O with diaereses above them, and they alphabetize together with other instances of the letters A and O, because that's what they are.

In Swedish those are their own letters, which are completely separate from the letters A and O. They get their own place in the alphabet (second-to last and last, respectively), and replacing them with AE and OE is technically not acceptable in Swedish like it is in German (though it's often done anyway, e.g. on airline tickets).

And in Unicode they are represented by the same code-point even though in one language it is a letter, and in the other language it's only a variation on another letter. What a mess.

link

jahewson 3498 days ago

That character is from the phonetic alphabet so it's not the "concept of G", it's the concept of a "voiced uvular stop", which happens to looks visually like G. So what Unicode is doing is separating two conceptually different ideas, exactly as intended.

The cases where Unicode has taken similar looking characters and combined them into one have not been successful, Han Unification for example was widely viewed as a misstep and has caused many problems, such as making it impossible to embed certain Japanese characters in Chinese text without higher-level markup.

link

stevenbedrick 3498 days ago

It actually does do something along those lines, with the "canonical" and "compatible" equivalence rules:

https://en.wikipedia.org/wiki/Unicode_equivalence

As mentioned by others on this thread, the real issue is not with Unicode per se, but rather with the ways that web browsers handle it (or fail to handle it, as the case may be).

link

zokier 3498 days ago

I think it is very much an issue in Unicode that they did not define the NFKD of ɢ to be G. As far as I can tell, the rationale is that ɢ is semantically different because it is used in IPA. I find that pretty weak, considering the ubiquity of smallcaps. Asking browsers to diverge (as far as equivalence goes) from Unicode standards sounds a lot like a failure of Unicode.

link

spullara 3498 days ago

The web browser or DNS?

link

drewmate 3498 days ago

That's a really interesting proposal, but I'm afraid it would be difficult to implement in practice. If this third dimension were actually encoded into the number that represents each character, you'd end up with a lot of wasted bits (since most characters probably wouldn't even need the 3rd dimension, or at least as much of it as the heaviest users.) Another option would be to supplement the metadata that already accompanies Unicode characters (which block it is in, the name of the character/block, etc...) This could be done in practice now, but the information would almost certainly just be ignored if it needed to be looked up in a supplemental table. Furthermore, it's difficult to agree on just about anything in Unicode, and classifying all the characters based on concept seems like a Herculean task for a slow-moving body.

Any ideas for how to accomplish this in practice?

link

a3n 3498 days ago

I'll get to that as soon as I make email secure by design.

link

jahewson 3498 days ago

This already exists in Unicode, it's called "Variation Selectors" and they have their own block and are used to select emoji skin tones amongst other things.

But it would be wrong to use them in this case because an IPA G and the letter G are semantically different things and should not be unified into a single character just because they look similar.

link

Lagged2Death 3498 days ago

The G is part of a block called "IPA extensions." Most of its content is more obviously specialized. This G is a phonetic G.

It's not necessarily the case that any given symbol has a bunch of different Unicode representations; unfortunately G has at least two, though.

https://en.m.wikipedia.org/wiki/IPA_Extensions

http://www.fileformat.info/info/unicode/block/ipa_extensions...

link