Hacker News new | ask | show | jobs
by WalterBright 101 days ago
Another dum dum Unicode idea is having multiple code points with identical glyphs.

Rule of thumb: two Unicode sequences that look identical when printed should consist of the same code points.

5 comments

If anything, Unicode should have had more disambiguated characters. Han unification was a mistake, and lower case dotted Turkish i and upper case dotless Turkish I should exist so that toUpper and toLower didn't need to know/guess at a locale to work correctly.
Characters should not have invisible semantics.
So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

> So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

> And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

Those Unicode homonyms are a solution looking for a problem.

> Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

Do you think 1, l and I should be encoded as the same character, or does this logic only extend to characters pesky foreigners use.

They are visually distinct to the reader.
That is entirely dependent on the font.
Unicode is about semantics not appearance. If you don't need semantics then use something different.
> Unicode is about semantics not appearance.

And that's where it went off the rails into lala land. 'a' can have all kinds of distinct meanings. How are you going to make that work? It's hopeless.

It already works.

Tell me what the problem is and what your proposed solution would be.

Infer the meaning from the context.

    a) it's a bullet point
    b) a+b means a is a variable
    c) apple means a means the sound "aaaah"
    d) ape means a means the sound "aye"
    e) 0xa means a means "10"
    f) "a" on my test paper means I did well on it
    g) grade "a" means I bought the good bolts
    h) "achtung" means it's a German "a"
I didn't need 8 different Unicode characters. And so on.
>Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

>should not be about semantic meaning,

It's always better to be able to preserve more information in a text and not less.

> I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

They look visually distinct to me. I don't get your point.

> It's always better to be able to preserve more information in a text and not less.

Text should not lose information by printing it and then OCR'ing it.

But these characters only look identical in some fonts. Are you saying that if you change font, some characters in a string should change appearance and others should not?

And what about the round-trip rule?

And ligatures? Aren't those a semantic distinction?

> But these characters only look identical in some fonts.

That's a problem with the fonts.

> And what about the round-trip rule?

Print Unicode on paper, then ocr it, and you'll get different Unicode. Oh, and normalization.

> ligatures

Generally an issue with rendering.

> semantic distinction

Unicode isn't about semantics (or shouldn't be). Consider 'a'. It's used for all kinds of meanings.

What about numbers? Would they be assigned to arabic only? I guess someone will be offended by that.

While at it we could also unify I, | and l. It's too confusing sometimes.

> While at it we could also unify I, | and l. It's too confusing sometimes.

They render differently, so it's not a problem.

They only render differently in some fonts, on some displays.
totally not true :D
Look again at its rendering!
I don't think that would help much. There are also characters which are similar but not the same and I don't think humans can spot the differences unless they are actively looking for them which most of the time people are not. If only one of two glyphs which are similar appear in the text nobody would likely notice, expectation bias will fuck you over.
I wonder how anybody got by with printed books.
As far as I know, glyphs are determined by the font and rendering engine. They're not in the Unicode standard.
Fraktur (font) and italic (rendering) are in the Unicode standard, although Hackernews will not render them. (I suspect that the Hackernews software filters out the nuttier Unicode stuff.)
One of the ground rules of Unicode is the round trip rule. You have to be able to translate to and from Unicode without loss of information.
They threw that out the window with normalization.