| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by acdha 1379 days ago

> It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.

Well, 10 code points because vowels can be capitalized and 12 because ÿ is used in other languages.

That's one of the easiest cases. Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

At the end of years of work, you'd have an encoding which is easier for C programmers to think about but means all of your documents require substantially more storage than they used to.

1 comments

indil 1379 days ago

>Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

Perhaps this is just my ignorance talking, but it can't be that many permutations, can it? Somebody linked to https://en.wikipedia.org/wiki/Zalgo_text, which I doubt anyone would seriously want to enable. There's, what, maybe 3-4 marks typically added to chars in the most complex of cases, mostly for vowels, like Vietnamese. With 4 billion code points to work with, that seems doable. We could just throw in all permutations, regardless of past utility, to accommodate future expansions of acceptable marks. Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

>but means all of your documents require substantially more storage than they used to.

Good point! But that comes down to a trade-off analysis between design and space. High 32-bit code point values are meant to be used too, and not shied away from.

acdha 1379 days ago

> Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

I believe it's over a hundred thousand (don't forget scholars need to work with classical and/or obscure characters which aren't in common usage), and while not common new ones are being added. Han unification is a good cautionary example to consider anytime you think something related to Unicode is easy: https://en.wikipedia.org/wiki/Han_unification

Now, there are on the order of 150K characters in Unicode so there is definitely a lot of room even for Chinese. I'm not so sure about the combinations because there are languages which use combining marks extensively (e.g. Navajo) and things like emoji skin tone modifiers (multiply everything with skin by 5 variants) or zero-width joiners to handle things like gender and you can get a lot of permutations if you were trying to precompose those to individual code points.

This is already sounding like a ton of work, even before you get to the question of getting adoption, and then you have to remember that the Unicode consortium explicitly says that diacritic marks aren't specific to a particular known language's usage so you either have to have every permutation or prove that nobody uses a particular one.

https://unicode.org/faq/char_combmark.html#10

The big question here is what the benefit would be, and it's hard to come up with one other than that everyone could treat strings as arrays of 32-bit integers. While nice, that doesn't seem compelling enough to take on a task of that order of magnitude.