| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by WalterBright 97 days ago

Unicode should be for visible characters. Invisible characters are an abomination. So are ways to hide text by using Unicode so-called "characters" to cause the cursor to go backwards.

Things that vanish on a printout should not be in Unicode.

Remove them from Unicode.

10 comments

pvillano 97 days ago

Unicode is "designed to support the use of text in all of the world's writing systems that can be digitized"

Unicode needs tab, space, form feed, and carriage return.

Unicode needs U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK to switch between left-to-right and right-to-left languages.

Unicode needs U+115F HANGUL CHOSEONG FILLER and U+1160 HANGUL JUNGSEONG FILLER to typeset Korean.

Unicode needs U+200C ZERO WIDTH NON-JOINER to encode that two characters should not be connected by a ligature.

Unicode needs U+200B ZERO WIDTH SPACE to indicate a word break opportunity without actually inserting a visible space.

Unicode needs MONGOLIAN FREE VARIATION SELECTORs to encode the traditional Mongolian alphabet.

link

luke-stanley 97 days ago

So we need a new standard problem due to the complexity of the last standard? Isn't unicode supposed to be a superset of ASCII, which already has control characters like new space, CR, and new lines? xD

link

WalterBright 97 days ago

The only ones people use any more are newline and space. A tab key is fine in your editor, but it's been more or less abandoned as a character. I haven't used a form feed character since the 1970s.

link

tetha 97 days ago

That ship has sailed, but I consider Unicode a good thing, yet I consider it problematic to support Unicode in every domain.

I should be able to use Ü as a cursed smiley in text, and many more writing systems supported by Unicode support even more funny things. That's a good thing.

On the other hand, if technical and display file names (to GUI users) were separate, my need for crazy characters in file names, code bases and such are very limited. Lower ASCII for actual file names consumed by technical people is sufficient to me.

link

WalterBright 97 days ago

> That ship has sailed

Sure, but more crazy stuff gets added all the time.

link

WalterBright 97 days ago

Another dum dum Unicode idea is having multiple code points with identical glyphs.

Rule of thumb: two Unicode sequences that look identical when printed should consist of the same code points.

link

estebank 97 days ago

If anything, Unicode should have had more disambiguated characters. Han unification was a mistake, and lower case dotted Turkish i and upper case dotless Turkish I should exist so that toUpper and toLower didn't need to know/guess at a locale to work correctly.

link

WalterBright 97 days ago

Characters should not have invisible semantics.

link

nswango 97 days ago

So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

link

WalterBright 97 days ago

> So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

> And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

Those Unicode homonyms are a solution looking for a problem.

link

bawolff 97 days ago

> Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

Do you think 1, l and I should be encoded as the same character, or does this logic only extend to characters pesky foreigners use.

link

WalterBright 97 days ago

They are visually distinct to the reader.

link

debazel 97 days ago

That is entirely dependent on the font.

link

Yokohiii 97 days ago

Unicode is about semantics not appearance. If you don't need semantics then use something different.

link

WalterBright 97 days ago

> Unicode is about semantics not appearance.

And that's where it went off the rails into lala land. 'a' can have all kinds of distinct meanings. How are you going to make that work? It's hopeless.

link

Yokohiii 97 days ago

It already works.

Tell me what the problem is and what your proposed solution would be.

link

Muromec 97 days ago

>Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

>should not be about semantic meaning,

It's always better to be able to preserve more information in a text and not less.

link

WalterBright 97 days ago

> I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

They look visually distinct to me. I don't get your point.

> It's always better to be able to preserve more information in a text and not less.

Text should not lose information by printing it and then OCR'ing it.

link

ted_dunning 97 days ago

But these characters only look identical in some fonts. Are you saying that if you change font, some characters in a string should change appearance and others should not?

And what about the round-trip rule?

And ligatures? Aren't those a semantic distinction?

link

WalterBright 97 days ago

> But these characters only look identical in some fonts.

That's a problem with the fonts.

> And what about the round-trip rule?

Print Unicode on paper, then ocr it, and you'll get different Unicode. Oh, and normalization.

> ligatures

Generally an issue with rendering.

> semantic distinction

Unicode isn't about semantics (or shouldn't be). Consider 'a'. It's used for all kinds of meanings.

link

Yokohiii 97 days ago

What about numbers? Would they be assigned to arabic only? I guess someone will be offended by that.

While at it we could also unify I, | and l. It's too confusing sometimes.

link

WalterBright 97 days ago

> While at it we could also unify I, | and l. It's too confusing sometimes.

They render differently, so it's not a problem.

link

ted_dunning 97 days ago

They only render differently in some fonts, on some displays.

link

Yokohiii 97 days ago

totally not true :D

link

WalterBright 97 days ago

Look again at its rendering!

link

jeltz 97 days ago

I don't think that would help much. There are also characters which are similar but not the same and I don't think humans can spot the differences unless they are actively looking for them which most of the time people are not. If only one of two glyphs which are similar appear in the text nobody would likely notice, expectation bias will fuck you over.

link

WalterBright 97 days ago

I wonder how anybody got by with printed books.

link

wcoenen 97 days ago

As far as I know, glyphs are determined by the font and rendering engine. They're not in the Unicode standard.

link

WalterBright 97 days ago

Fraktur (font) and italic (rendering) are in the Unicode standard, although Hackernews will not render them. (I suspect that the Hackernews software filters out the nuttier Unicode stuff.)

link

ted_dunning 97 days ago

One of the ground rules of Unicode is the round trip rule. You have to be able to translate to and from Unicode without loss of information.

link

WalterBright 97 days ago

They threw that out the window with normalization.

link

ted_dunning 97 days ago

No need to remove them. Just make them visible for applications that don't need to render every language. Make that behavior optional as well in case you really want to name characters with Hangul or Tibetan.

Some middle ground so that you can use greek letters in Julia might be nice as well.

But I don't see any purpose in using the Personal Use Areas (PUA) in programming.

link

abujazar 97 days ago

Invisible characters are there for visible characters to be printed correctly...

link

WalterBright 97 days ago

I'll grant that a space and a newline are necessary. The rest, nope.

link

abujazar 97 days ago

You're talking about a subset of ASCII then. Unicode is supposed to support different languages and advanced typography, for which those characters are necessary. You can't write e.g. Arabic or Hebrew without those "unnecessary" invisible characters.

link

WalterBright 97 days ago

Please explain why an invisible zero width "character" is necessary.

link

slim 97 days ago

if you write كلب which is an arabic word written right to left in the middle of an english sentence, you want to preserve the order of the characters in the stream for computer processing purposes. meaning the chararacter ك must come before the ل and after the e and the space with respect to the memory layout. whereas when displayed, it must be inverted to be legible. the solution is to have an invisible character that indicates a switch in text direction. if you were wondering, the situation where you want to write text in a foreign language within your text is very common outside english speaking countries.

link

WalterBright 97 days ago

Look I'm writing sdrawkcab (amazingly, I did it without using Unicode!). Layout is the job of your text formatting program. It's easy to fix a text editor to support right-to-left text entry.

The switch in text direction has resulted in malicious code injection attacks, as the reversed text becomes invisible. I had to change my compiler to reject those Unicode characters for that reason. It can be used in other cases to have hidden, malicious text.

Have you checked your SQL code for invisible backwards text that injects malware?

link

ted_dunning 97 days ago

To prevent ligatures from forming when you need that.

link

WalterBright 97 days ago

That's the job of a typesetting language.

link

krior 97 days ago

To mark linewrapping-breakpoints in strings.

link

WalterBright 97 days ago