Hacker News new | ask | show | jobs
by jcampbell1 4542 days ago
Other than "substring", how often do you really run into this problem? I have never had a problem with PHP and UTF-8.

Oh, good luck trying to match all unicode punctuation with Python/Javascript. With php it is as simple as preg_match('/\pP/u'

I deal with a lot of UTF-8 issues, and while php is super ugly, there are pretty good solutions in cases where it requires ugly hacks in other languages.

5 comments

It's a common misconception. Many people don't understand that the normal string functions are perfectly safe on UTF-8, as long as you don't use hardcoded lengths or offsets. I.e. substr($str, 0, 50) is not safe due to the explicit "50" in there, but substr($str, 0, strpos($str, "foo")) will work correctly on any well-formed UTF-8.

If people have encoding issues in PHP it usually just means that they didn't manage to set up their database properly (you know, finding which one of the 10 encoding options in MySQL is the right one ;)

From my personal experience I've had a lot more issues with encoding in Python than I had in PHP - exactly because PHP ignores encoding and lets me deal with it.

PHP should keep everything as-is and simply add new data type for unicode strings. It should be utterly and completely incompatible with any current function that accepts strings:

    $binString = "hello";
    $unicodeString = u"Hello";
    strlen($binString); // 5
    strlen($unicodeString); // Error
And then developers can slowly start making functions and methods more unicode aware as necessary. Or make an entirely new string API. Then just have functions to take binary strings and convert them to unicode strings (providing an encoding) and unicode strings to binary strings (also providing an encoding).

This would be way safer and simpler than PHP6 or Python 3.

I agree. Python creates a lot of problems that are non-obvious. Like os.walk('.') works just fine, right up to the point where it silently trashes all unicode file names.
As I understand in PHP to work with Unicode you need to use Multibyte String Functions since Unicode encodes into 2+ bytes vs the traditional 1 byte. So doing a "substr" as per your example would not work unless you use "mb_substr".
Sorry, but PHP's string handling is abysmal.

- strlen can't be used for length checks if you care about length in characters, a problem compounded when you use a real database where field lengths are defined in characters instead of bytes

- the only way to iterate char by char instead of byte per byte is to use mb_substr, as there is no trick to make $str[$i] do anything but return bytes.

- the string and array API's have inverse ordering of their parameters, which means that after a decade of writing PHP I still can't remember which is which without looking it up

- Typing mb_ in front of everything is ugly enough, but it also makes autocompletion tricky, especially since the strings don't have methods. (e.g. $str->pos())

- Speaking of which, since the strings don't have methods, they stick out like a sore thumb in OO code. String and array handling code invariably ends up ugly unless you write everything procedural style (and then you have other issues).

- Sort() cannot be made to sort unicode on windows, regardless of which parameters you give it. In fact, the only way to sort unicode on windows is by using the Collator from the intl extension. Part of that is microsoft's fault by not supporting UTF-8 in the windows API's at all, but PHP isn't helping.

- If you don't care about windows, the proper way to sort text is first calling setlocale(LC_COLLATE, "en_US.UTF8") and then passing the SORT_LOCALE_STRING argument as second parameter to every call to sort(). Ugly, ugly, ugly.

- natsort(), aka "natural sort" cannot be used to sort text like a human would expect, in any context. It always produces invalid results, even for ANSI codepages. (e.g. try to sort resume, rope and résumé)

- The use of utf8_decode and utf8_encode is actively harmful in almost all circumstancces. There is never a good reason to use them, since the very rare case where you need them iconv or mb_convert_encoding are better suited. Yet, the PHP documentation doesn't tell you this, causing lots of people to be led astray (as I once was).

- Oh yeah, and there are no less than three API's for unicode string handling, the mb_ functions, the iconv_ functions and the grapheme_ functions. What's the difference? I don't know, and I really can't be bothered to read PHP's source to find out.

- htmlentities() always requires the parameters ENT_QUOTES, "UTF-8" to do its job securely (well, almost, as it doesn't encode forward slash which OWASP recommends). Unless you use a wrapper, your code is yet again uglified.

- The secure way to JSON-encode text is, and I kid you not, json_encode($data, JSON_HEX_TAG|JSON_HEX_APOS|JSON_HEX_QUOT|JSON_HEX_AMP). Try typing that three times in a row, I dare you.

- And finally, mysql is by far the worst database for unicode handling, because it cannot sort unicode text according to the standard, at all, no matter what you do. That's not PHP's fault ofcourse, but since I'm bitching... :)

The json_encode one is particularly annoying as you can't even consider sending JSON to the browser and using it in JS without all the flags on. especially if you're dealing with user data or scraping websites.
Just out of curiosity, could you expand on that a little more? I'm currently in the process of ripping my hair out with json_encode issues.

We send back a json_encoded array containing a property "html" with html markup to be injected back into a contenteditable div. Originally I had a ton of problems if there happened to be the invisible unicode character in there (php's str functions could not find/replace this character no matter what I tried). However, this is the first I've seen those flags the GP mentioned.

If you look up the json_encode function on php.net it lists the flags. As I understand it there's a difference between the encoding allowed in javascript and the encoding allowed in JSON which means certain characters which would be valid in a javascript object don't necessarily work in JSON, so you end up having to escape everything possible. There's one flag (JSON_ESCAPE_UNICODE) which doesn't work on the version of PHP I develop on so I ended up applying code listed in the comments. This also adds some cruft similar to what google/facebook etc use if you want it. I haven't had any issues with it yet but, being what it is, there may be a more elegant way to accomplish it.

   function AsJSON($arr, $cruft=null){
		
	//convmap since 0x80 char codes so it takes all multibyte codes (above ASCII 127). 
	//So such characters are being "hidden" from normal json_encoding
        array_walk_recursive($arr, function (&$item, $key) {  	
        	if (is_string($item)){  
        		$item = mb_encode_numericentity($item, array (0x80, 0xffff, 0, 0xffff), 'UTF-8');
        	}
        });
        
        $JSON = mb_decode_numericentity(
        	json_encode($arr, JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP), 
        	array (0x80, 0xffff, 0, 0xffff), 
        	'UTF-8'
        );

        if(json_last_error() === JSON_ERROR_NONE){ 
	  return $cruft.$JSON;
	 	}
	}
You might also consider looking at the ctype_* functions if you haven't.
Seeing the type of content, the character in question might very well be a BOM.

You can identify them by pack("CCC",0xef,0xbb,0xbf), encoding to ""

define('OH_MY_THOSE_FLAG', FLAG | FLAG | FLAG | FLAG)

But then again, my name gives me special powers.

Nevertheless...
Yeah but as with many complains about PHP, how would you define the default behavior? I'm seeing so many people who expect PHP to always do stuff how they want, because of how easy it is to work with. I never read those complaints about super tricky C cases.

Modifying the content by default would be completely counter-intuitive, specially with a function that is used mostly in API and thus being encoded / decoded in different places and languages.

IMO content filtering is specific to the view where you output it, and should be done there. Any hazardous content should always go through a sanitize function when you echo it in the middle of HTML.

I agree with most of your bashing. PHP is a fucking mess, and non OO strings is a hangover vomit from C.

That being said, when there is a problem, there actually is a solution. PCRE actually works. Javascript, for instance, has no collation support at all as far as I can tell.

Personally, most of the problems with UTF-8 are mixed content issues.

The best fix is not the Python route, but rather just deprecating a bunch of stuff such as utf8_encode/decode. Throw a warning when any database connection is not utf8. Throw a warning when the OS is not setup to return UTF-8. It is more important that people run php in a end-to-end utf-8 environment, than changing the internals. Once people have a good environment, they will stop talking about strlen/strpos which are really not much of a problem. Maybe they should be renamed bytelen/bytepos, but php has too many of that type of problem to count.

99% of the UTF-8 problems don't exist if everything is UTF-8. Counting unicode code points vs bytes is not the real problem. The real problem is bullshit like 'SET NAMES utf8' / setlocale('LC_ALL','en_US.utf-8')

BTW, what language do you think gets this stuff right? Go looks promising, but it is brand new. I have problems with pretty much every language I know well: javascript/python/Objective-C/PHP

You're the reason PHP has a bad reputation. You have no idea what you're doing, and you use your mouth not only to breathe, but to spill hate on it too!

80% of your points can be answered by wrappers/helpers. How often do PHP developers have to walk through a user-generated string of multibyte cacarcters? Ok, didn't think so. And when you do you have all the required tools. 99% of string handling is simple copy, and the byte apporach is great at that.

A few points that stand out, and give an amusing idea of how you implement stuff:

    "Unless you use a wrapper"
    "What's the difference? I don't know, and I really can't be bothered to read"
    "Try typing that three times in a row, I dare you"
    "Oh yeah, and there are no less than three API"
    "a real database where field lengths are defined in characters"
    "makes autocompletion tricky" on native function...
PS:

- natsort() is intended for numbers

- ENT_QUOTES isn't required unless you use single quotes to encapsulate your outputs

- ugly code = using different function names.. ok

- PHP isn't helping Microsoft.. ok

- utf8_decode is a shortcut for iconv() using the 2 most common encodings, just FYI

Edit: formatting

I do run into this fairly regularly. The mb_ functions are solid, but I was hitting something the other day where a character, I think it was NBSP, was causing a string to output empty in 5.4 (but was working fine in 5.3).

I think there's something that needs to be fixed at a core level, maybe with PHP 6, that just guts how the language deals with multibyte. Even if it means making it backwards incompatible. I'd probably take the opportunity to drop the mb_ functions, namespace the entire language, and make it multibyte by default. Needs doing eventually!

Pretty often. Yes you can use mb_* functions everywhere but not all string functions have an mb_* counterpart, and forget just once and you risk blowing up the entire rest of the request. Not to mention times when you have to use a 3rd party library that doesn't bother with mb_*.

Native UTF is such an important thing... I know it's tough to implement, but come on guys!

strlen and strpos are two fairly commonly used functions that come to mind.
Use mb_strlen and mb_strpos instead where the string may contain multibyte characters.
Or patchwork/utf8 [1] since it handles fallback in the event mb_* isn't installed and can utilize several other libs.

[1] https://github.com/nicolas-grekas/Patchwork-UTF8