Hacker News new | ask | show | jobs
by nikic 4542 days ago
It's a common misconception. Many people don't understand that the normal string functions are perfectly safe on UTF-8, as long as you don't use hardcoded lengths or offsets. I.e. substr($str, 0, 50) is not safe due to the explicit "50" in there, but substr($str, 0, strpos($str, "foo")) will work correctly on any well-formed UTF-8.

If people have encoding issues in PHP it usually just means that they didn't manage to set up their database properly (you know, finding which one of the 10 encoding options in MySQL is the right one ;)

From my personal experience I've had a lot more issues with encoding in Python than I had in PHP - exactly because PHP ignores encoding and lets me deal with it.

3 comments

PHP should keep everything as-is and simply add new data type for unicode strings. It should be utterly and completely incompatible with any current function that accepts strings:

    $binString = "hello";
    $unicodeString = u"Hello";
    strlen($binString); // 5
    strlen($unicodeString); // Error
And then developers can slowly start making functions and methods more unicode aware as necessary. Or make an entirely new string API. Then just have functions to take binary strings and convert them to unicode strings (providing an encoding) and unicode strings to binary strings (also providing an encoding).

This would be way safer and simpler than PHP6 or Python 3.

I agree. Python creates a lot of problems that are non-obvious. Like os.walk('.') works just fine, right up to the point where it silently trashes all unicode file names.
As I understand in PHP to work with Unicode you need to use Multibyte String Functions since Unicode encodes into 2+ bytes vs the traditional 1 byte. So doing a "substr" as per your example would not work unless you use "mb_substr".