Hacker News new | ask | show | jobs
by Joeri 4542 days ago
Sorry, but PHP's string handling is abysmal.

- strlen can't be used for length checks if you care about length in characters, a problem compounded when you use a real database where field lengths are defined in characters instead of bytes

- the only way to iterate char by char instead of byte per byte is to use mb_substr, as there is no trick to make $str[$i] do anything but return bytes.

- the string and array API's have inverse ordering of their parameters, which means that after a decade of writing PHP I still can't remember which is which without looking it up

- Typing mb_ in front of everything is ugly enough, but it also makes autocompletion tricky, especially since the strings don't have methods. (e.g. $str->pos())

- Speaking of which, since the strings don't have methods, they stick out like a sore thumb in OO code. String and array handling code invariably ends up ugly unless you write everything procedural style (and then you have other issues).

- Sort() cannot be made to sort unicode on windows, regardless of which parameters you give it. In fact, the only way to sort unicode on windows is by using the Collator from the intl extension. Part of that is microsoft's fault by not supporting UTF-8 in the windows API's at all, but PHP isn't helping.

- If you don't care about windows, the proper way to sort text is first calling setlocale(LC_COLLATE, "en_US.UTF8") and then passing the SORT_LOCALE_STRING argument as second parameter to every call to sort(). Ugly, ugly, ugly.

- natsort(), aka "natural sort" cannot be used to sort text like a human would expect, in any context. It always produces invalid results, even for ANSI codepages. (e.g. try to sort resume, rope and résumé)

- The use of utf8_decode and utf8_encode is actively harmful in almost all circumstancces. There is never a good reason to use them, since the very rare case where you need them iconv or mb_convert_encoding are better suited. Yet, the PHP documentation doesn't tell you this, causing lots of people to be led astray (as I once was).

- Oh yeah, and there are no less than three API's for unicode string handling, the mb_ functions, the iconv_ functions and the grapheme_ functions. What's the difference? I don't know, and I really can't be bothered to read PHP's source to find out.

- htmlentities() always requires the parameters ENT_QUOTES, "UTF-8" to do its job securely (well, almost, as it doesn't encode forward slash which OWASP recommends). Unless you use a wrapper, your code is yet again uglified.

- The secure way to JSON-encode text is, and I kid you not, json_encode($data, JSON_HEX_TAG|JSON_HEX_APOS|JSON_HEX_QUOT|JSON_HEX_AMP). Try typing that three times in a row, I dare you.

- And finally, mysql is by far the worst database for unicode handling, because it cannot sort unicode text according to the standard, at all, no matter what you do. That's not PHP's fault ofcourse, but since I'm bitching... :)

3 comments

The json_encode one is particularly annoying as you can't even consider sending JSON to the browser and using it in JS without all the flags on. especially if you're dealing with user data or scraping websites.
Just out of curiosity, could you expand on that a little more? I'm currently in the process of ripping my hair out with json_encode issues.

We send back a json_encoded array containing a property "html" with html markup to be injected back into a contenteditable div. Originally I had a ton of problems if there happened to be the invisible unicode character in there (php's str functions could not find/replace this character no matter what I tried). However, this is the first I've seen those flags the GP mentioned.

If you look up the json_encode function on php.net it lists the flags. As I understand it there's a difference between the encoding allowed in javascript and the encoding allowed in JSON which means certain characters which would be valid in a javascript object don't necessarily work in JSON, so you end up having to escape everything possible. There's one flag (JSON_ESCAPE_UNICODE) which doesn't work on the version of PHP I develop on so I ended up applying code listed in the comments. This also adds some cruft similar to what google/facebook etc use if you want it. I haven't had any issues with it yet but, being what it is, there may be a more elegant way to accomplish it.

   function AsJSON($arr, $cruft=null){
		
	//convmap since 0x80 char codes so it takes all multibyte codes (above ASCII 127). 
	//So such characters are being "hidden" from normal json_encoding
        array_walk_recursive($arr, function (&$item, $key) {  	
        	if (is_string($item)){  
        		$item = mb_encode_numericentity($item, array (0x80, 0xffff, 0, 0xffff), 'UTF-8');
        	}
        });
        
        $JSON = mb_decode_numericentity(
        	json_encode($arr, JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP), 
        	array (0x80, 0xffff, 0, 0xffff), 
        	'UTF-8'
        );

        if(json_last_error() === JSON_ERROR_NONE){ 
	  return $cruft.$JSON;
	 	}
	}
You might also consider looking at the ctype_* functions if you haven't.
Seeing the type of content, the character in question might very well be a BOM.

You can identify them by pack("CCC",0xef,0xbb,0xbf), encoding to ""

define('OH_MY_THOSE_FLAG', FLAG | FLAG | FLAG | FLAG)

But then again, my name gives me special powers.

Nevertheless...
Yeah but as with many complains about PHP, how would you define the default behavior? I'm seeing so many people who expect PHP to always do stuff how they want, because of how easy it is to work with. I never read those complaints about super tricky C cases.

Modifying the content by default would be completely counter-intuitive, specially with a function that is used mostly in API and thus being encoded / decoded in different places and languages.

IMO content filtering is specific to the view where you output it, and should be done there. Any hazardous content should always go through a sanitize function when you echo it in the middle of HTML.

I agree with most of your bashing. PHP is a fucking mess, and non OO strings is a hangover vomit from C.

That being said, when there is a problem, there actually is a solution. PCRE actually works. Javascript, for instance, has no collation support at all as far as I can tell.

Personally, most of the problems with UTF-8 are mixed content issues.

The best fix is not the Python route, but rather just deprecating a bunch of stuff such as utf8_encode/decode. Throw a warning when any database connection is not utf8. Throw a warning when the OS is not setup to return UTF-8. It is more important that people run php in a end-to-end utf-8 environment, than changing the internals. Once people have a good environment, they will stop talking about strlen/strpos which are really not much of a problem. Maybe they should be renamed bytelen/bytepos, but php has too many of that type of problem to count.

99% of the UTF-8 problems don't exist if everything is UTF-8. Counting unicode code points vs bytes is not the real problem. The real problem is bullshit like 'SET NAMES utf8' / setlocale('LC_ALL','en_US.utf-8')

BTW, what language do you think gets this stuff right? Go looks promising, but it is brand new. I have problems with pretty much every language I know well: javascript/python/Objective-C/PHP

You're the reason PHP has a bad reputation. You have no idea what you're doing, and you use your mouth not only to breathe, but to spill hate on it too!

80% of your points can be answered by wrappers/helpers. How often do PHP developers have to walk through a user-generated string of multibyte cacarcters? Ok, didn't think so. And when you do you have all the required tools. 99% of string handling is simple copy, and the byte apporach is great at that.

A few points that stand out, and give an amusing idea of how you implement stuff:

    "Unless you use a wrapper"
    "What's the difference? I don't know, and I really can't be bothered to read"
    "Try typing that three times in a row, I dare you"
    "Oh yeah, and there are no less than three API"
    "a real database where field lengths are defined in characters"
    "makes autocompletion tricky" on native function...
PS:

- natsort() is intended for numbers

- ENT_QUOTES isn't required unless you use single quotes to encapsulate your outputs

- ugly code = using different function names.. ok

- PHP isn't helping Microsoft.. ok

- utf8_decode is a shortcut for iconv() using the 2 most common encodings, just FYI

Edit: formatting