| Sorry, but PHP's string handling is abysmal. - strlen can't be used for length checks if you care about length in characters, a problem compounded when you use a real database where field lengths are defined in characters instead of bytes - the only way to iterate char by char instead of byte per byte is to use mb_substr, as there is no trick to make $str[$i] do anything but return bytes. - the string and array API's have inverse ordering of their parameters, which means that after a decade of writing PHP I still can't remember which is which without looking it up - Typing mb_ in front of everything is ugly enough, but it also makes autocompletion tricky, especially since the strings don't have methods. (e.g. $str->pos()) - Speaking of which, since the strings don't have methods, they stick out like a sore thumb in OO code. String and array handling code invariably ends up ugly unless you write everything procedural style (and then you have other issues). - Sort() cannot be made to sort unicode on windows, regardless of which parameters you give it. In fact, the only way to sort unicode on windows is by using the Collator from the intl extension. Part of that is microsoft's fault by not supporting UTF-8 in the windows API's at all, but PHP isn't helping. - If you don't care about windows, the proper way to sort text is first calling setlocale(LC_COLLATE, "en_US.UTF8") and then passing the SORT_LOCALE_STRING argument as second parameter to every call to sort(). Ugly, ugly, ugly. - natsort(), aka "natural sort" cannot be used to sort text like a human would expect, in any context. It always produces invalid results, even for ANSI codepages. (e.g. try to sort resume, rope and résumé) - The use of utf8_decode and utf8_encode is actively harmful in almost all circumstancces. There is never a good reason to use them, since the very rare case where you need them iconv or mb_convert_encoding are better suited. Yet, the PHP documentation doesn't tell you this, causing lots of people to be led astray (as I once was). - Oh yeah, and there are no less than three API's for unicode string handling, the mb_ functions, the iconv_ functions and the grapheme_ functions. What's the difference? I don't know, and I really can't be bothered to read PHP's source to find out. - htmlentities() always requires the parameters ENT_QUOTES, "UTF-8" to do its job securely (well, almost, as it doesn't encode forward slash which OWASP recommends). Unless you use a wrapper, your code is yet again uglified. - The secure way to JSON-encode text is, and I kid you not, json_encode($data, JSON_HEX_TAG|JSON_HEX_APOS|JSON_HEX_QUOT|JSON_HEX_AMP). Try typing that three times in a row, I dare you. - And finally, mysql is by far the worst database for unicode handling, because it cannot sort unicode text according to the standard, at all, no matter what you do. That's not PHP's fault ofcourse, but since I'm bitching... :) |