| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by targonca 2493 days ago

That's... a very bad idea.

First of all, there's graphemes are inherently not one-to-one with code points, e.g. Á = A + `. There's simply no Unicode encoding that will let you safely index into an array without paying attention to the meaning of the underlying codepoints. (and no, using NFC won't solve this either, because there are combinations for which there's no composed equivalent)

Secondly, general formatting info won't fit into 11 bits (italic, bold, underline, strikethrough - that's already 4 bits, and we haven't talked about color, font weights other than bold, etc.), so why bother baking in a limited, intentionally gimped version into your character encoding?

1 comments

naikrovek 2493 days ago

It doesn't have to be formatting...

It is not a "very bad idea" it is "an idea you do not like." Those are different things.

The way you're describing UTF-32, it can't work at all, and it definitely does.

Trying to save space by using UTF-8 over UTF-32 seems like a very small gain to me, is all. UTF-32 is simpler, for text created in that encoding.

link

targonca 2492 days ago

There are tons of resources online about why UTF-32 doesn't make sense. I'm not gonna repeat them. Do your own research.

https://news.ycombinator.com/item?id=8195827

https://softwareengineering.stackexchange.com/questions/2361...

https://en.wikipedia.org/wiki/UTF-32#Analysis

http://utf8everywhere.org/#myths

link