| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btilly 2387 days ago

Unicode was originally designed to fit in 16 bits, and this is memorialized in Java APIs that make it easy to mess up.

The unicode character does not specify the glyph to draw. Han unification is the best known, but not only source, of this challenge.

The glyph does not specify the unicode character. Precombined vs combining characters is a source of this challenge. The result is that a name can be entered into a database then unfindable due to a search.

This feature has also been a source of security holes. See https://appcheck-ng.com/unicode-normalization-vulnerabilitie... for an explanation of how.

You would think that you could avoid this through banning control and combining characters and not lose anything. Indeed at one point the authors of Go (who included the inventors of UTF-8) thought this. But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters.

There are also lots and lots of invisible characters. This has been used to "fingerprint" text. (Each person gets a different invisible signature. The forwarded email includes the signature.) That's an interesting feature but complicates matching text documents even more.

Need I go on? When I see Unicode, I know that there lie dragons that programmers don't necessarily expect.

1 comments

maxerickson 2387 days ago

One of your points is that an encoding designed to handle languages has support for more than one kind of white space. Given that languages use more than one kind of white space, this is sort of a necessity.

Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.

Those aren't inconsistencies, so do feel free to go on.

link

btilly 2386 days ago

One of your points is that an encoding designed to handle languages has support for more than one kind of white space.

No. It is that there is more than one kind of invisible character. No language has invisible characters.

Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.

Not sure what point you are misreading here. But that was not among my points.

link

maxerickson 2386 days ago

You said "But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters."

I suppose I didn't consider that they could be written without combining characters given a different design.

As far as invisible characters, I'm not interested in arguing about it. English, as written, has all sorts of different structural uses of white space, it isn't all just style.

link

btilly 2386 days ago

I suppose I didn't consider that they could be written without combining characters given a different design.

They could be.

Likewise European languages can be written without precombined characters. The fact that é can be written in multiple ways was my point.

As far as invisible characters, I'm not interested in arguing about it. English, as written, has all sorts of different structural uses of white space, it isn't all just style.

You still don't understand. I am not talking about whitespace. I am talking about invisible zero-width characters that can be slipped into text with no sign that they are there. Characters like U+180E, U+200B, U+FEFF, U+200C, U+200D, and U+FEFF. Not to mention that you can achieve the same thing with control characters like U+200FU+200E. (The undetectability of the last one is language dependent.)

As I said, this can be used to invisibly sign a document. But I don't see any other particular point to having so many ways to accomplish what looks like nothing.

link