Hacker News new | ask | show | jobs
by faragon 3961 days ago
Is the Turkish case conversion working?

edit: it seems it is: https://bitbucket.org/knight666/utf8rewind/pull-requests/1/t...

1 comments

No, currently Turkish, Greek, Lithuanian and Azeri case mappings are not always grammatically correct. This is because utf8rewind currently does not take the system locale into account when case mapping. Fixing these issues is planned for a future release.
You can consider also set system-independent locale, e.g. "set_turkisk_mode" (I had that problem, too), etc. I thought that the only case conversion exception as the Turkish case. Do you remember which cases are an exception for Greek Lithuanian and Azeri? Also, I know that also German has some non-bijective cases ("ß" -> SS).

In case you want to save space in tables, you can opt for encoding ranges in the code, e.g. check sc_tolower()/sc_toupper() into: https://github.com/faragon/libsrt/blob/master/src/schar.c

You can find all case folding codepoints on the Unicode Consortium website: ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt and ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt.

As you get down the list you'll notice what a pain in the ass the special cases are. There's a special case for the final sigma in a Greek word:

    03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
You must remove the dot from "i" when upper or titlecasing... but only in Lithuanian:

    0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
Etc. etc. By the way, my implementation for case mapping started out similarly to yours, but I ultimately solved the problem using a binary search in a huge look-up table: https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...

Unicode case mapping is just a huge mess of exceptions, but that's more the humans' fault than the standard.

Thank you! :-)