| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshdata 110 days ago

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)

That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.

In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.

3 comments

ZoneZealot 110 days ago

I think we're expecting too much from an LLM generated article from a user that has been spending a lot of time spamming their content across multiple platforms and websites.

link

paultendo 110 days ago

Thanks Josh - putting this article out there has pushed me to sharpen a lot of my thinking which hopefully should come across in my more recent work. I've updated the article to scope the NFKC recommendation to identifiers and added a note crediting your correction. Thanks for catching it.

link

bawolff 110 days ago

I feel like for search, NFKD and then remove all the combining characters would be a better bet than NFKC.

Of course there are also purpose specific algorithms for preparing text for search that would be even better.

link