Hacker News new | ask | show | jobs
by eevee 3936 days ago
OP here. You say this all, and yet, if I google for "unicode strip accents"...

Top-voted answer uses NFD, one below it uses NFKD: http://stackoverflow.com/questions/517923/what-is-the-best-w...

NFKD: http://www.perlmonks.org/?node_id=835238

NFD: http://www.perlmonks.org/?node_id=1105025

NFD: http://www.perlmonks.org/?node_id=485681

NFD: http://drillio.com/en/software/java/remove-accent-diacritic/

NFKD: https://gist.github.com/j4mie/557354

Two and a half of the first six results blindly apply NFKD to arbitrary text. All of them use normalization.

Sad state of affairs.

2 comments

"Strip accents" is not a well-defined operation outside of a specific locale. Does "Ö" have an accent or not? In German, yes: it's an O with an umlaut. In English, yes: it's an O with some funny dots on it (heavy metal umlauts?). In the "New Yorker" dialect of English, it's an O with a dieresis. But in Hungarian, Finnish, Turkish, and many others, it's not: it's the letter between O and P, or between O and Ő, or after Z, or...

If you do want to do this, you should know that it only makes sense in your own locale, and you shouldn't be surprised that the methods are somewhat ad-hoc (I'm not saying you shouldn't do this: I've done it myself).

In German, history of the letter and rules even dictate that ö should be written as oe in such cases (that's what it evolved from and that's what the two dots are; e.g. it's not a diaeresis in German, despite looking the same).
Some of what you find googling is just wrong. Dealing with global characters is confusing, people get it wrong a lot, and suggest wrong answers.

But it's true, as far as i know, that there's no unicode standard way to 'strip accents', which is unfortunate because we sometimes do need to do it. Even if 'strip accents' is locale dependent, and may have no sensible answer in some locales, I think there are sensible ways to do it in some locales (certainly in English, for Latin characters at least), and I wish there were a recognized best practice standard for doing it that could be implemented identically in various languages (maybe there is and I don't know it?).

There are unicode standard ways to compare/sort strings ignoring accents, in at least some locales, which might get you there if you reverse engineered them and took them further.

At any rate, at the end of the day, you can't simply talk about 'unicode normalization' without talking about the four different unicode normalization forms (canonical and compatibility; decomposed and composed) -- if you do, you are definitely getting something wrong.

And also, unicode normalization forms are definitely _not_ intended to 'strip accents', that is not what they are for, they aren't the solution to that, even if the compatibility normalizations do it in some cases.