Hacker News new | ask | show | jobs
by frign 655 days ago
Author of libgrapheme here: Both collation and normalisation are non-trivial and have many gotchas thanks to the way the Unicode consortium likes to write their specifications. I sometimes get the feeling that they don't even care about implementers and just document what is done in the reference implementation ICU.

The only sensible normalisation one can implement is the full decomposition (NFD), and maybe the full composition (NFC). You rely on the full decomposition if you want to collate correctly, which is a problem because the amount of memory needed to store the decomposition is unbounded in general. I don't want to make the libgrapheme users jump through hoops, and I also don't want to do any memory allocations in libgrapheme either.

There is an idea floating in my head on how to solve this, but I'm currently busy finalising Unicode 15.1 support (Unicode 16.0, released on the 10th, will be trivial to upgrade to) and releasing my already fully-compliant implementation of the Unicode bidirectional algorithm.

1 comments

I see, thanks for replying. I agree, Unicode specs are hell to work with (I tried doing a auto-codegen thing based on them and just gave up due to the size of tables generated and the seemingly-arbitrary edge cases). libgrapheme looks pretty good otherwise, I'll keep an eye on it for whenever I have to wrangle with Unicode on a low level again (hopefully not for a long time).