| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by p_l 703 days ago
	There's a semantic difference between "accented letter" and "different letter that happens to visually look like another language's accented letter". "Ą" in polish is not "A" with some accent. And the idea behind unicode was to preserve human written text, including keeping track of things like "this is letter A1 with an accent, but this is letter A2 that looks visually similar to A1 with accent but is different semantically". Of course then worries about code page size resulted in the stupidity of Han unification, so Unicode is a bit broken.

2 comments

Dylan16807 703 days ago

Unless there's some nuance I'm missing, I think you're reading too much into the word "accent".

Especially because the codepoint is actually called "Combining Ogonek".

And for anyone writing in Cyrillic, it's actually more accurate to use the combining form, even as its own letter, because the only precomposed form technically uses a latin A.

But my main point is that I do not think there is supposed to be any semantic difference in Unicode based on whether you use precomposed or decomposed code points.

link

eviks 702 days ago

But it is precisely "a with some accent", you just have two ways to encode it for

link

p_l 702 days ago

"Ą" is a separate letter in polish alfabet, not an accented variant of "A".

There are writing systems where combining accents are used to represent just variation on a letter. Use of combining characters for "Ą" (and "Ć" and "Ł" and many other so-called "polish letters") is, at best, a historical artefact of trying to write them in deficient encodings.

link

eviks 702 days ago

It doesn't matter that it's a separate letter in an alphabet, you're denying the obvious - it IS an accented (or ogonek'ed) variant of A, and you can achieve this in Unicode in 2 ways: having one id for a precomposed variant and composing the variant from two ids.

There is no semantic difference, just an encoding one, the end result looks the same and means the same thing (well, to a point, it still depends on the context - like what language you mean - but within the same context it's the same thing and there are even Unicode rules to treat it the same like in search etc.)

And precomposed is just the same historical deficiency - you could've just as well designed a more compact encoding with no precomposed letters, only combinations

link

pests 702 days ago

This is correct, and you can look into Unicode Normalized Form C (NFC) to find the conversion and equivalence rules.

link