Hacker News new | ask | show | jobs
by Mithrandir 5095 days ago
I think this was a good explanation:

"No, the problem results because lowercase i (in most languages) and uppercase I (in most languages) are not actually considered to be the upper/lower variant of the same letter in Turkish. In Turkish, the undotted ı is the lowercase of I, and the dotted İ is the uppercase of i. If you have a class named Image, it will break if the locale is changed to turkish because class_exists() function uses zend_str_tolower(), and changes the case on all classes, because they are supposed to be case insensitive. Someone else above explained it very well:

"class_exists() function uses zend_str_tolower(). zend_str_tolower() uses zend_tolower(). zend_tolower() uses _tolower_l() on Windows and tolower() on other oses. _tolower_l() is not locale aware. tolower() is LC_CTYPE aware."

Edit: Someone else later said the following (I'm wondering if it's true):

"This, practically, can't be fixed. Mainly because there's no way to know if 'I' is uppercase of 'i' or 'ı' since there's not a separate place for Turkish 'I' in code tables. The same holds for 'i' (can't be known if it's lowercase of 'I' or 'İ'). I told 2 years ago and will say it again: PHP should provide a way to turn off case-insensitive function/class name lookup. No good programmer uses this Basic language feature since identifiers are case-sensitive in all real languages like Python, Ruby, C#, Java."

1 comments

But, why should the locale change the way PHP code is interpreted? Shouldn't LC_ALL="C" when parsing the code?

Maybe it breaks if you embed unicode strings or something. What do other languages do?

If it wasn't clear by the comments on the bug report or by the quoted sections of this comment's parent, let me rephrase it. This issue is entirely caused by the fact that PHP is case insensitive for classes and function names (but not variables, go figure). That is, if you define a class MyClass, you can instantiate it using MyClass or myclass or MYCLASS. You can call the functions from the standard library in whatever case either (so, array_map or ARRAY_MAP is fine).

Based on the behavior of this bug, it appears that the way PHP handles this case insensitivity is that it just lowercases all class and function names before resolving them. And this bug in particular shows up for Turkish because 'i' is not the lowercase equivalent of 'I'.

Pretty much all other modern languages are case sensitive, so I'd be surprised to find this issue elsewhere.

Well, VB.NET is case insensitive yet the problem doesn’t crop up there because it’s not braindead enough to use the same locale while compiling & executing. Yes, I get that PHP code isn’t compiled in a separate step but there still is no reason for it to use a user-defined locale. It should use the C locale, end of story. I don’t understand why this isn’t trivial to fix. Is there any place where PHP depends on a user-defined locale for parsing?

EDIT: “trivial to fix” as in, doesn’t cause regression, not necessarily that it’s a small change to the code base.

Class names can crop up during execution as well though. This is valid PHP:

  $classname = $row_I_got_from_mysql['classname'];
  $object = new $classname;
I'm sure this can still be solved though. It's not trivial, but it's not "takes over 9 years to fix" complex either.
PHP could just use the approach NTFS uses on Windows and convert to upper case instead:

http://blogs.msdn.com/b/michkap/archive/2004/12/02/273619.as...

Which would not help in this case, since in turkish the upper-case representation of `i` is not `I` but a different symbol. So the class you're looking for would not exist.
That still doesn't make sense. If 'i' is not the lowercase equivalent of 'I', then the lowercasing should just result in another letter, right? The only thing that could cause the bug is if it uses two different ways of lowercasing (perhaps one when registering the class, and another way when looking up the class).

The mapping between uppercase and lowercase can be completely arbitrary, and as long as it's used consistently you shouldn't get these kind of bugs.

it's really not that simple. Check the "Fold Case" section of the Letter Case article on Wikipedia; the explanation is much better:

http://en.wikipedia.org/wiki/Letter_case#Unicode_case_foldin...

It's not PHP's fault that accurately performing case transformations across locales is difficult; it's just actually very difficult. The solution isn't to "fix" the process of transforming letter case; the solution is to simply not transform the names of your identifiers. Unfortunately that is simple only in a very isolated setting; in the real world, doing such a thing is liable to break a lot of software.

This is a really good example of the problem at hand:

>The Greek letter Σ has two different lowercase forms: "ς" in word-final position and "σ" elsewhere.

The identifiers are lowercased multiple times; first at parse time, presumably using the locale of the OS, some setting in php.ini, or some fixed locale. (it doesn't, in practice, matter where this initial locale is set; it just matters that it's set at parse time.) It's then lowercased again at runtime; if the locale was changed at runtime, such that the casing rules in the two locales produce any differences, the identifier will not be found.

I'm not saying this to defend PHP; just to shed some light on the case-folding problem. Having case-insensitive identifiers is a design mistake.

The issue only occurs when the locale is changed between registering and looking up the class.
Doesn't look like that from the bug report; there the locale is set first, then the class is defined and then looked up.
PHP registers classes (and functions) at parse time, not at execution time.

i.e. this will print "bar":

    <?php
    echo foo();
    function foo() { return "bar"; }
Unfortunately, the obvious answer (parse code in the C locale) breaks code that's in the wild and relies on PHP's undocumented locale-specific case-insensitivity.

Obviously, case-insensitive identifiers are a bad idea, but PHP is stuck with them at this point.

I don't think it's "obvious" at all. The problems caused by case sensitive identifiers are legion, especially in dynamic languages with implicit declaration. It's not at all clear to me that this problem is of case insensitive identifiers and not simply in PHP's implementation.
Changing the case insensitivity of PHP identifiers is a good idea. I suspect breakage would be minimal and easily fixable. One has to keep in mind that variable names are already case sensitive in PHP, and most (reasonably good) code I've seen in the wild does honor the spelling of class and method names.

Of course it will never happen, but I really think this would be a great idea for the next major version. Backwards compatibility is important but not at all costs. A language needs to remain agile enough to allow for the recognition of (and ultimately the fixing of) mistakes.

I doubt that breakage would be minimal. I'm not as certain as you that most code does honor the spelling of class and method names, but even if we assume that this is the case I'd assume that there are tons of undetected errors. Currently, there's just no way to test that you're using the proper spelling, so nobody does.
It would be trivial to write a command line tool to check (and maybe automatically fix) those typos. Incorporating it into a major new version also ensures everybody has enough time to prepare - and in case of hopeless and unfixable legacy apps: there is always the option not to upgrade.

Breaking changes in programming languages are not that uncommon, C# and Perl spring to mind from personal experience, but also to a lesser degree such things have happened with PHP itself (and it became better for it). In this case, it's actually a change that moves the runtime's behavior closer to what's expected. It's a change that improves internal consistency while also eliminating silly bugs like the one discussed here.

Having seen the hairballs of creative PHP in the wild, I'd think the only sane way to do this would be some sort of deprecation warning whenever a symbol lookup matched only case insensitively.
Just using LC_ALL="C" would break people using some other language to write code. I personally strictly disagree with writing code in anything but English, but other people think it's okay to have Russian class names or something. Using LC_ALL="C" would make this impossible.