| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by asveikau 4040 days ago

Sure, go to 32 bits per character. But it cannot be said to be "simple" and will not allow you to make the assumption that 1 integer = 1 glyph.

Namely it won't save you from the following problems:

    * Precomposed vs multi-codepoint diacritics (Do you write á with
      one 32 bit char or with two? If it's Unicode the answer is both)

    * Variation selectors (see also Han unification)

    * Bidi, RTL and LTR embedding chars

And possibly others I don't know about. I feel like I am learning of these dragons all the time.

I almost like that utf-16 and more so utf-8 break the "1 character, 1 glyph" rule, because it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avoid such problems. Unicode just isn't simple any way you slice it, so you might as well shove the complexity in everybody's face and have them confront it early.

1 comments

cygx 4040 days ago

If you use a 32-bit scheme, you can dynamically assign multi-character (extended) grapheme clusters to unused code units to get a fixed-width encoding.

Perl6 calls this NFG [1].

[1] http://design.perl6.org/S15.html

^ link currently broken, the plain-text version is at https://raw.githubusercontent.com/perl6/specs/master/S15-uni...

link

lmm 4040 days ago

You can't use that for storage.

> The mapping between negative numbers and graphemes in this form is not guaranteed constant, even between strings in the same process.

link

cygx 4040 days ago

What's your storage requirement that's not adequately solved by the existing encoding schemes?

link

lmm 4039 days ago

What are you suggesting, store strings in UTF8 and then "normalize" them into this bizarre format whenever you load/save them purely so that offsets correspond to grapheme clusters? Doesn't seem worth the overhead to my eyes.

link

cygx 4039 days ago

In-memory string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) as well as some well-known libraries (ICU, Windows API, Qt) use UTF-16 internally. How much data do you have lying around that's UTF-16?

Sure, more recently, Go and Rust have decided to go with UTF-8, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-2, UCS-4 as appropriate) model if you have to do actual processing instead of just passing opaque strings around.

Also note that you have to go through a normalization step anyway if you don't want to be tripped up by having multiple ways to represent a single grapheme.

link

raiph 4039 days ago

NFG enables O(N) algorithms for character level operations.

The overhead is entirely wasted on code that does no character level operations.

For code that does do some character level operations, avoiding quadratic behavior may pay off handsomely.

link