|
|
|
|
|
by knight666
3961 days ago
|
|
You can find all case folding codepoints on the Unicode Consortium website: ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt and ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt. As you get down the list you'll notice what a pain in the ass the special cases are. There's a special case for the final sigma in a Greek word: 03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
You must remove the dot from "i" when upper or titlecasing... but only in Lithuanian: 0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
Etc. etc. By the way, my implementation for case mapping started out similarly to yours, but I ultimately solved the problem using a binary search in a huge look-up table: https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...Unicode case mapping is just a huge mess of exceptions, but that's more the humans' fault than the standard. |
|