Hacker News new | ask | show | jobs
by 2shortplanks 711 days ago
Fowler's Law on Unicode: There's always another bug, you just haven't found it yet.

Dr Drang's script counts the number of _characters_ not the number of _glyphs_. This matters because there's more than one way to represent é: Either just as unicode character \x{e9} ("NFC") or as a combination of "e" and the combining character that adds the accent ("NFD")

For example for "léon" this prints out "l3n" for me.

What you need to do is normalize to NFC.

> /usr/bin/perl -C -MUnicode::Normalize -pe '$_=NFC($_);s/(.)(.+)(.)/$1 . length($2) . $3/e'

1 comments

NFC isn't right, either: some letters don't have pre-composed forms. Imo, you need to pull in a whole glyph-counting algorithm.