| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ajross 4040 days ago

And as the linked article explains, UTF-16 is a huge mess of complexity with back-dated validation rules that had to be added because it stopped being a wide-character encoding when the new code points were added. UTF-16, when implemented correctly, is actually significantly more complicated to get right than UTF-8.

UTF-32/UCS-4 is quite simple, though obviously it imposes a 4x penalty on bytes used. I don't know anything that uses it in practice, though surely something does.

Again: wide characters are a hugely flawed idea.

2 comments

asveikau 4040 days ago

Sure, go to 32 bits per character. But it cannot be said to be "simple" and will not allow you to make the assumption that 1 integer = 1 glyph.

Namely it won't save you from the following problems:

    * Precomposed vs multi-codepoint diacritics (Do you write á with
      one 32 bit char or with two? If it's Unicode the answer is both)

    * Variation selectors (see also Han unification)

    * Bidi, RTL and LTR embedding chars

And possibly others I don't know about. I feel like I am learning of these dragons all the time.

I almost like that utf-16 and more so utf-8 break the "1 character, 1 glyph" rule, because it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avoid such problems. Unicode just isn't simple any way you slice it, so you might as well shove the complexity in everybody's face and have them confront it early.

cygx 4040 days ago

If you use a 32-bit scheme, you can dynamically assign multi-character (extended) grapheme clusters to unused code units to get a fixed-width encoding.

Perl6 calls this NFG [1].

[1] http://design.perl6.org/S15.html

^ link currently broken, the plain-text version is at https://raw.githubusercontent.com/perl6/specs/master/S15-uni...

lmm 4040 days ago

You can't use that for storage.

> The mapping between negative numbers and graphemes in this form is not guaranteed constant, even between strings in the same process.

cygx 4040 days ago

What's your storage requirement that's not adequately solved by the existing encoding schemes?

lmm 4039 days ago

What are you suggesting, store strings in UTF8 and then "normalize" them into this bizarre format whenever you load/save them purely so that offsets correspond to grapheme clusters? Doesn't seem worth the overhead to my eyes.

cygx 4039 days ago

In-memory string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) as well as some well-known libraries (ICU, Windows API, Qt) use UTF-16 internally. How much data do you have lying around that's UTF-16?

Sure, more recently, Go and Rust have decided to go with UTF-8, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-2, UCS-4 as appropriate) model if you have to do actual processing instead of just passing opaque strings around.

Also note that you have to go through a normalization step anyway if you don't want to be tripped up by having multiple ways to represent a single grapheme.

raiph 4039 days ago

NFG enables O(N) algorithms for character level operations.

The overhead is entirely wasted on code that does no character level operations.

For code that does do some character level operations, avoiding quadratic behavior may pay off handsomely.

jheriko 4040 days ago

i think linux/mac systems default to UCS-4, certainly the libc implementations of wcs* do.

i agree its a flawed idea though. 4 billion characters seems like enough for now, but i'd guess UTF-32 will need extending to 64 too... and actually how about decoupling the size from the data entirely? it works well enough in the general case of /every type of data we know about/ that i'm pretty sure this specialised use case is not very special.

ajross 4040 days ago

The Unixish C runtimes of the world uses a 4-byte wchar_t. I'm not aware of anything in "Linux" that actually stores or operates on 4-byte character strings. Obviously some software somewhere must, but the overwhelming majority of text processing on your linux box is done in UTF-8.

That's not remotely comparable to the situation in Windows, where file names are stored on disk in a 16 bit not-quite-wide-character encoding, etc... And it's leaked into firmware. GPT partition names and UEFI variables are 16 bit despite never once being used to store anything but ASCII, etc... All that software is, broadly, incompatible and buggy (and of questionable security) when faced with new code points.

CUViper 4040 days ago

We don't even have 4 billion characters possible now. The Unicode range is only 0-10FFFF, and UTF-16 can't represent any more than that. So UTF-32 is restricted to that range too, despite what 32 bits would allow, never mind 64.

But we don't seem to be running out -- Planes 3-13 are completely unassigned so far, covering 30000-DFFFF. That's nearly 65% of the Unicode range completely untouched, and planes 1, 2, and 14 still have big gaps too.

vorg 4040 days ago

> But we don't seem to be running out

The issue isn't the quantity of unassigned codepoints, it's how many private use ones are available, only 137,000 of them. Publicly available private use schemes such as ConScript are fast filling up this space, mainly by encoding block characters in the same way Unicode encodes Korean Hangul, i.e. by using a formula over a small set of base components to generate all the block characters.

My own surrogate scheme, UTF-88, implemented in Go at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to 2 billion as originally specified by using the top 75% of the private use codepoints as 2nd tier surrogates. This scheme can easily be fitted on top of UTF-16 instead. I've taken the liberty in this scheme of making 16 planes (0x10 to 0x1F) available as private use; the rest are unassigned.

I created this scheme to help in using a formulaic method to generate a commonly used subset of the CJK characters, perhaps in the codepoints which would be 6 bytes under UTF-8. It would be more difficult than the Hangul scheme because CJK characters are built recursively. If successful, I'd look at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-8 and UTF-32 officially extended to 2 billion characters.

raiph 4039 days ago

What do you make of NFG, as mentioned in another comment below?

vorg 4037 days ago

NFG uses the negative numbers down to about -2 billion as a implementation-internal private use area to temporarily store graphemes. Enables fast grapheme-based manipulation of strings in Perl 6. Though such negative-numbered codepoints could only be used for private use in data interchange between 3rd parties if the UTF-32 was used, because neither UTF-8 (even pre-2003) nor UTF-16 could encode them.

raiph 4036 days ago

Thanks.

cpeterso 4040 days ago

Yes. sizeof(wchar_t) is 2 on Windows and 4 on Unix-like systems, so wchar_t is pretty much useless. That's why C11 added char16_t and char32_t.

colomon 4040 days ago

I'm wondering how common the "mistake" of storing UTF-16 values in wchar_t on Unix-like systems? I know I thought I had my code carefully basing whether it was UTF-16 or UTF-32 based on the size of wchar_t, only to discover that one of the supposedly portable libraries I used had UTF-16 no matter how big wchar_t was.

clort 4039 days ago

Unix-like systems except for MirBSD, which uses a 16-bit wchar_t