| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jrochkind1 3934 days ago

Some of what you find googling is just wrong. Dealing with global characters is confusing, people get it wrong a lot, and suggest wrong answers.

But it's true, as far as i know, that there's no unicode standard way to 'strip accents', which is unfortunate because we sometimes do need to do it. Even if 'strip accents' is locale dependent, and may have no sensible answer in some locales, I think there are sensible ways to do it in some locales (certainly in English, for Latin characters at least), and I wish there were a recognized best practice standard for doing it that could be implemented identically in various languages (maybe there is and I don't know it?).

There are unicode standard ways to compare/sort strings ignoring accents, in at least some locales, which might get you there if you reverse engineered them and took them further.

At any rate, at the end of the day, you can't simply talk about 'unicode normalization' without talking about the four different unicode normalization forms (canonical and compatibility; decomposed and composed) -- if you do, you are definitely getting something wrong.

And also, unicode normalization forms are definitely _not_ intended to 'strip accents', that is not what they are for, they aren't the solution to that, even if the compatibility normalizations do it in some cases.