| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rstuart4133 2389 days ago

I can possibly forgive them not coming up with UTF-8 first, and a few character specific mistakes. I can even forgive them for not coming up with UTF-8 at all, which is fact what happened because it was invented outside of the organisation and then common usage forced it onto them.

What I can not forgive is coming up with something that allows the same string have multiple representations, which happens in UCS-2 / UTF-16, and because pre-composed codepoints overlap pre-existing characters. That mistake makes "string1" == "string2" meaningless in the general case. It's already caused exploits.

I also can not forgive pre-composed codepoints for another reason: the make parsing strings error prone. This is because ("A" in "Åström") no longer works - it will return true in if Å is composite character.

1 comments

cryptonector 2389 days ago

You've not read the comments here explaining why that arose naturally from the way human scripts work, not from choices they made.

Let me explain again. We have these pesky diacritical marks. How shall they be represented? Well, if you go with combining codepoints, the moment you need to stack more than one diacritical on a base character then you have a normalization problem. Ok then! you say, let us go with precompositions only. But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a', or that you can't easily come up with new compositions the way one could with overstrike (in ASCII) or with combining codepoints in decomposed forms. Fine!, you say, I'll take that! But now consider Hangul, where the combinations are to form syllables that are themselves glyphs... the problems get much worse now. What might your answer be to that? That the rest of the world should abandon non-English scripts, that we should all romanize all non-English languages, or even better(?), abandon all non-English languages?

(And we haven't even gotten to case issues. You think there are no case issues _outside_ Unicode? That would be quite wrong. Just look at the Turkish dot-less i, or Spanish case folding for accented characters... For example, in French upcasing accented characters does not lose the accent, but in Spanish they do -or used to, or are allowed to be lost, and often are-, which means that accented characters don't round-trip in Spanish.)

No, I'm sorry, but the equivalence / normalization problem was unavoidable -- it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.

The moment you accept that this is an unavoidable problem, life gets better. Now you can make sure that you have suitable Unicode implementations.

What? you think that normalization is expensive? But if you implement a form-insensitive string comparison, you'll see that the vast majority of the time you don't have to normalize at all, and for mostly-ASCII strings the performance of form-insensitivity can approach that of plain old C locale strcmp().

link

rstuart4133 2388 days ago

> But now consider Hangul, where the combinations are to form syllables that are themselves glyphs... the problems get much worse now. What might your answer be to that?

I don't have an answer, mostly because I know nothing about Hangul. Maybe decomposition is the right solution there. Frankly I don't care what Unicode does to solve the problems Hangul creates, and as Korea about 1% of the world's population I doubt many other people here care either.

I'm commenting about Latin languages. There is absolutely no doubt what is easiest for a programmer there: one code point per grapheme. We've tried considering 'A' and 'a' the same in both computer languages and file systems. It was a mess. No one does it any more.

> But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a'

It's not a problem. We know how to match 'A' and 'a', which are in every sense closer than 'á' and 'a' ('á' and 'a' can be different phonetically, 'A' and 'a' aren't). If matching both 'A' and 'a' isn't a major issue, why would 'á' and 'a' be an issue that Unicode must to solve for us?

In fact given it's history I'm sort of surprised Unicode didn't try and solve it by adding a composition to change case. shudder

> And we haven't even gotten to case issues.

The "case issues" should not have been Unicode's issue at all. Unicode should have done one thing, well. That one thing was ensure visually distinct string had one, and only one, unique encoding.

There is objective reason for wanting that. Typically programmers do not do much with strings. The two most common things they do is move them around, and compare them for equality but also sort them. They naturally don't read the Unicode standard. They just expect the binary representation of strings to faithfully follow what their eyes tell them should happen: if two strings look identical, their Unicode representation will be identical. It's not an unreasonable expectation. If it's true those basic operations of moving and comparing will be simple, and more importantly efficient on a computer.

The one other thing we have to do a lot less often, but nonetheless occupies a fair bit of our time is parsing a string. It occupies our time because it's fiddly and takes a lot of code, and is error prone. I still remember the days when languages string handling are a selection criteria. (It's still the reason I dislike Fortran.) I'm not talking about complex parsing here - it's usually something like spilt it into words or file system path components, or going looking for a particular token. It invariably means moving along the string one grapheme at time, sniffing for what you want and extracting it. (Again this quite possibly is only meaningful for Latin based languages - but that's OK because the things we are after are invariably Latin characters in configuration files, file names and the like. The rest can be treated as uninteresting blobs.) And now Unicode's composition has retrospectively simple operation far harder to do correctly.

All other text handling programmers do is now delegated to libraries of some sort. You mention one: nobody does case conversion themselves. They call strtolower() for ASCII or a Unicode equivalent. Hell, as soon as you leave Latin we even printing it correctly requires years of expertise to master. The problems that crop up may as you say may be unavoidable, but that's OK because they are so uncommon I'm willing to wear the speed penalty to use somebody else's code to do it.

> it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.

Did someone say that? Anyway, it's pretty obvious why it happened. When a person invents new hammer, the second thing they do is going looking for all the other problems it might solve. A little tweak here and it would do that job too! I saw an apprentice sharpen the handle of his Estwing hammer once. It did make it a useful wire cutter in a pinch, but no prizes for guessing what happen when he just using it as a hammer.

Unicode acquired it's warts by attempting to solve everybody's problems. Instead of making it more and more complex, they should have ruthlessly optimised it to make it work near flawlessly it's most common user: a programmer who couldn't give a shit about internationalisation, and wasted the bare minimum of his time on stackoverflow before using it.

The tragedy is it didn't do that.

link

cryptonector 2388 days ago

> > it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.

> Did someone say that?

Yes, u/kazinator did.

> Anyway, it's pretty obvious why it happened. When a person invents new hammer, the second thing they do is going looking for all the other problems it might solve.

That's not why decomposition happened. It happened because a) decomposition already existed outside Unicode, b) it's useful. Ditto pre-composition.

> Unicode acquired it's warts by attempting to solve everybody's problems.

Unicode acquired its warts by attempting to be an incremental upgrade to other codesets. And also by attempting to support disparate scripts with disparate needs. The latter more than the former.

> Instead of making it more and more complex, they should have ruthlessly optimised it to make it work near flawlessly it's most common user: a programmer who couldn't give a shit about internationalisation, ...

They did try to ruthlessly optimize it: by pursuing CJK unification. That failed due to external politics.

As to the idea that programmers who want nothing to do with I18N are the most common user or Unicode, that's rather insulting to the real users: the end users. All of this is to make life easier on end users: so they need not risk errors due to not their (or their SW) not being able to keep track of what codeset/encoding some document is written in, so they can mix scripts in documents, and so on.

Unicode is not there to make your life harder. It's there to make end users' lives easier. And it's succeeded wildly at that.

> > And we haven't even gotten to case issues.

> The "case issues" should not have been Unicode's issue at all. Unicode should have done one thing, well. That one thing was ensure visually distinct string had one, and only one, unique encoding.

You really should educate yourself on I18N.

link

rstuart4133 2388 days ago

> As to the idea that programmers who want nothing to do with I18N are the most common user or Unicode, that's rather insulting to the real users: the end users.

Oh for Pete's sake. Unicode / ASCII / ISO 8859-1 are encoding computers and thus programmers use to represent text. Users don't read Unicode, they read text. They never, ever have to deal with Unicode and most wouldn't know what it was if it leapt up and hit them in the faxe, so if Unicode justified adding features to accommodate these non-existent users, I guess that explains how we got into this mess.

link

cryptonector 2388 days ago

They read text in many scripts (because many writers use more than one script, and many readers can read more than one script). Without Unicode you can usually use TWO scripts: ASCII English + one other (e.g., ISO8859-*, SHIFT_JIS, ...). There are NO OTHER CODESETS than Unicode that can support THREE or more scripts, or any combination of TWO where one isn't ASCII English. For example, Indian subcontinent users are very likely to speak multiple languages and use at least two scripts other than ASCII English. Besides normal people all over the world who have to deal with multiple scripts, there's also: scolars, diplomats, support staff at many multi-nationals, and many others who also need to deal with multiple scripts.

Whether you like it or not, Unicode exists to make USERS' lives better. Programmers?? Pfft. We can deal with the complexity of all of that. User needs, on the other hand, simply cannot be met reasonably with any alternatives to Unicode.

link

rstuart4133 2385 days ago

> Whether you like it or not, Unicode exists to make USERS' lives better.

They've been using multiple scripts long before computers. When computers came along those users quite reasonably demanded they be able to write the same scripts. This created a problem for the programmers. The obvious solution is a universal set of code points - and ISO 10646 was born. It was not rocket science. But if it had not come along some other hack / kludge would have been used because the market is too large to be abandoned by the computer companies. They would have put us programmers in a special kind of hell, but I can guarantee the users would not have known about that, let alone cared.

Oddly the encoding schemes proposed by ISO 10646 universally sucked. Unicode walked into that vacuum with their first cock up - 16 bits was enough for anybody. It was not just dumb because it was wrong - it was dumb because they didn't propose unique encoding. They gave us BOM markers instead. Double fail. They didn't win because the one thing they added, their UCS-2 encoding, was any better than what came before. They somehow managed to turn it into a Europe vs USA popularity contest. Nicely played.

Then Unicode and 10646 became the same thing. They jointly continued on in the same manner as before, inventing new encodings to paper over the UCS-2 mistake. Those new encodings all universally sucked. The encoding we programmers use today, UTF-8, was invented by, surprise, surprise, a programmer who was outside of Unicode / 10646 group think. It was popularised at a programmers conference, USENIX, and from there on it was obvious was going to be used regardless of what Unicode / 10646 thought of it, so they accepted it.

If Perl6 is any indication, the programmers are getting the shits with the current mess. Perl6 has explicitly added functions that treat text as a stream of grapheme's rather than Unicode code points. The two are the same of course except when your god damned compositions rear their ugly head - all they do is eliminate that mess. Maybe you should take note. Some programmers have already started using something other than Unicode because it makes their jobs easier. If it catches on you are going to find out real quickly just how much the users care about the encoding scheme computers use for text.

None of this is to trivialise the task of assigning code points to grapheme's. It's huge task. But for pete's sake don't over inflate your ego's by claiming you are doing some great service for mankind. The only thing on the planet that uses the numbers Unicode assigns to characters as is computers. Computers are programmed by one profession - programmers. Your entire output is consumed by that one profession. Yet for all the world what you've written here seems to say Unicode has some higher purpose, and you are optimising for that, whatever it may be. For gods same come down to earth.

link