| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Charlotte_Buff 1384 days ago

Reversing a string is a useless operation in the real world. Its only application is padding out interview questions. “How to reverse a string” is also an incredibly vague question. What do you actually want me to do? Reverse code points, or code units, or grapheme clusters, or make it look like it’s written backwards? It doesn’t even make sense as a concept in most of the world’s writing systems.

It’s like giving me a list of numbers and asking me to “combine” them. What does that mean? Do I sum them up, or concatenate them, or something else entirely? A lot of string reversal solutions are “incorrect” because there isn’t even a correct question in the first place.

Even with an infinitely large code space, doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone. If Unicode had been the first digital character set ever created, it would not contain a single precomposed code point because they are utterly impractical. As such, normalisation – or at least the canonical reordering part of it – is always going to be a necessity.

1 comments

indil 1384 days ago

>Reversing a string is a useless operation in the real world

I'm not sure why you focused on this one example, which was just meant to indicate the nature of the issue, not cite a broad concrete problem. There are plenty of situations where you'd want to operate on graphemes, not code points, like deleting the previous grapheme in a text editor. It would certainly help programmers write correct code if the two were the same.

>doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone

It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.

link

acdha 1384 days ago

> It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.

Well, 10 code points because vowels can be capitalized and 12 because ÿ is used in other languages.

That's one of the easiest cases. Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

At the end of years of work, you'd have an encoding which is easier for C programmers to think about but means all of your documents require substantially more storage than they used to.

link

indil 1384 days ago

>Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

Perhaps this is just my ignorance talking, but it can't be that many permutations, can it? Somebody linked to https://en.wikipedia.org/wiki/Zalgo_text, which I doubt anyone would seriously want to enable. There's, what, maybe 3-4 marks typically added to chars in the most complex of cases, mostly for vowels, like Vietnamese. With 4 billion code points to work with, that seems doable. We could just throw in all permutations, regardless of past utility, to accommodate future expansions of acceptable marks. Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

>but means all of your documents require substantially more storage than they used to.

Good point! But that comes down to a trade-off analysis between design and space. High 32-bit code point values are meant to be used too, and not shied away from.

link

acdha 1384 days ago

> Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

I believe it's over a hundred thousand (don't forget scholars need to work with classical and/or obscure characters which aren't in common usage), and while not common new ones are being added. Han unification is a good cautionary example to consider anytime you think something related to Unicode is easy: https://en.wikipedia.org/wiki/Han_unification

Now, there are on the order of 150K characters in Unicode so there is definitely a lot of room even for Chinese. I'm not so sure about the combinations because there are languages which use combining marks extensively (e.g. Navajo) and things like emoji skin tone modifiers (multiply everything with skin by 5 variants) or zero-width joiners to handle things like gender and you can get a lot of permutations if you were trying to precompose those to individual code points.

This is already sounding like a ton of work, even before you get to the question of getting adoption, and then you have to remember that the Unicode consortium explicitly says that diacritic marks aren't specific to a particular known language's usage so you either have to have every permutation or prove that nobody uses a particular one.

https://unicode.org/faq/char_combmark.html#10

The big question here is what the benefit would be, and it's hard to come up with one other than that everyone could treat strings as arrays of 32-bit integers. While nice, that doesn't seem compelling enough to take on a task of that order of magnitude.

link

Charlotte_Buff 1382 days ago

> It seems to me it would be trivial to enumerate these combinations, and assign code points to them.

Far from it. Even if you limit yourself to just Latin, the number of valid (whatever “valid” even means) combinations is already unmanageably gargantuan. Just look at phonetic notation as one example of many. The basic IPA alone uses over 100 letters for consonants and vowels, plus dozens of different diacritics, many of which need to be present concurrently on the same base letter. Make the jump to extended IPA or any number of other, more specialised transcription systems – and there are plenty – and you’ll never see the end of it.

Sure, it may be technically possible to create an exhaustive list of letter-and-diacritic combinations, just like you can technically create an exhaustive list of every single human on Earth, but good luck getting there. And good luck making sure you didn’t miss anything in the process.

Of course, you don’t need to limit yourself to Latin, because Unicode has 160 other writing systems to offer.

Writing systems like Tibetan and Newa where consonants can be stacked vertically to arbitrary heights and then have vowel signs and other marks attached as a bonus as well.

Or Hangul which would occupy no less than 1,638,750 code points if all possible syllable blocks were encoded atomicly, and that doesn’t even account for the archaic tone marks, or those novel letters that North Korea once tried to establish that aren’t even in Unicode yet.

Or Sutton SignWriting whose system of combining marks and modifiers is so complex that I’m not even gonna explain it here.

If you eschew combining characters then yes, you will create an encoding where every code point is at the same time a full grapheme cluster and that definitely has concrete advantages, but as a consequence you have now assigned to yourself the unenviable task of having to possess perfect, nigh-omniscient knowledge of every single thing that a person has ever written down in the entirety of human history. Because unless you possess that knowledge, you will leave out things that some people need to type on a computer under some circumstances.

Every time some scholar discovers a previously forgotten vowel sign in an old Devanagari manuscript, you need to encode not only that one new character, but every combination of that vowel sign with any of the (currently) 53 Devanagari consonants, plus Candrabindu, Anusvara, and Visarga at the very least, just in case these combinations pop up somewhere, because they’re all linguistically meaningful and well-defined.

It’s doable, in a sense, but why would you subject yourself to that if you can just make characters combine with each other instead?

link