| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simias 5143 days ago
	But, why should the locale change the way PHP code is interpreted? Shouldn't LC_ALL="C" when parsing the code? Maybe it breaks if you embed unicode strings or something. What do other languages do?

3 comments

shuzchen 5143 days ago

If it wasn't clear by the comments on the bug report or by the quoted sections of this comment's parent, let me rephrase it. This issue is entirely caused by the fact that PHP is case insensitive for classes and function names (but not variables, go figure). That is, if you define a class MyClass, you can instantiate it using MyClass or myclass or MYCLASS. You can call the functions from the standard library in whatever case either (so, array_map or ARRAY_MAP is fine).

Based on the behavior of this bug, it appears that the way PHP handles this case insensitivity is that it just lowercases all class and function names before resolving them. And this bug in particular shows up for Turkish because 'i' is not the lowercase equivalent of 'I'.

Pretty much all other modern languages are case sensitive, so I'd be surprised to find this issue elsewhere.

klmr 5143 days ago

Well, VB.NET is case insensitive yet the problem doesn’t crop up there because it’s not braindead enough to use the same locale while compiling & executing. Yes, I get that PHP code isn’t compiled in a separate step but there still is no reason for it to use a user-defined locale. It should use the C locale, end of story. I don’t understand why this isn’t trivial to fix. Is there any place where PHP depends on a user-defined locale for parsing?

EDIT: “trivial to fix” as in, doesn’t cause regression, not necessarily that it’s a small change to the code base.

kalleboo 5143 days ago

Class names can crop up during execution as well though. This is valid PHP:

  $classname = $row_I_got_from_mysql['classname'];
  $object = new $classname;

I'm sure this can still be solved though. It's not trivial, but it's not "takes over 9 years to fix" complex either.

rb12345 5143 days ago

PHP could just use the approach NTFS uses on Windows and convert to upper case instead:

http://blogs.msdn.com/b/michkap/archive/2004/12/02/273619.as...

Xylakant 5143 days ago

Which would not help in this case, since in turkish the upper-case representation of `i` is not `I` but a different symbol. So the class you're looking for would not exist.

rb12345 5142 days ago

Oops - looks like you're actually right there. For some reason, I thought I'd read that i and ı were both mapped to I.

pieter 5143 days ago

That still doesn't make sense. If 'i' is not the lowercase equivalent of 'I', then the lowercasing should just result in another letter, right? The only thing that could cause the bug is if it uses two different ways of lowercasing (perhaps one when registering the class, and another way when looking up the class).

The mapping between uppercase and lowercase can be completely arbitrary, and as long as it's used consistently you shouldn't get these kind of bugs.

zemo 5143 days ago

it's really not that simple. Check the "Fold Case" section of the Letter Case article on Wikipedia; the explanation is much better:

http://en.wikipedia.org/wiki/Letter_case#Unicode_case_foldin...

It's not PHP's fault that accurately performing case transformations across locales is difficult; it's just actually very difficult. The solution isn't to "fix" the process of transforming letter case; the solution is to simply not transform the names of your identifiers. Unfortunately that is simple only in a very isolated setting; in the real world, doing such a thing is liable to break a lot of software.

This is a really good example of the problem at hand:

>The Greek letter Σ has two different lowercase forms: "ς" in word-final position and "σ" elsewhere.

The identifiers are lowercased multiple times; first at parse time, presumably using the locale of the OS, some setting in php.ini, or some fixed locale. (it doesn't, in practice, matter where this initial locale is set; it just matters that it's set at parse time.) It's then lowercased again at runtime; if the locale was changed at runtime, such that the casing rules in the two locales produce any differences, the identifier will not be found.

I'm not saying this to defend PHP; just to shed some light on the case-folding problem. Having case-insensitive identifiers is a design mistake.

bnr 5143 days ago

The issue only occurs when the locale is changed between registering and looking up the class.

pieter 5143 days ago

Doesn't look like that from the bug report; there the locale is set first, then the class is defined and then looked up.

bnr 5143 days ago

PHP registers classes (and functions) at parse time, not at execution time.

i.e. this will print "bar":

    <?php
    echo foo();
    function foo() { return "bar"; }

LawnGnome 5143 days ago

Unfortunately, the obvious answer (parse code in the C locale) breaks code that's in the wild and relies on PHP's undocumented locale-specific case-insensitivity.

Obviously, case-insensitive identifiers are a bad idea, but PHP is stuck with them at this point.

Tloewald 5143 days ago

I don't think it's "obvious" at all. The problems caused by case sensitive identifiers are legion, especially in dynamic languages with implicit declaration. It's not at all clear to me that this problem is of case insensitive identifiers and not simply in PHP's implementation.

Udo 5143 days ago

Changing the case insensitivity of PHP identifiers is a good idea. I suspect breakage would be minimal and easily fixable. One has to keep in mind that variable names are already case sensitive in PHP, and most (reasonably good) code I've seen in the wild does honor the spelling of class and method names.

Of course it will never happen, but I really think this would be a great idea for the next major version. Backwards compatibility is important but not at all costs. A language needs to remain agile enough to allow for the recognition of (and ultimately the fixing of) mistakes.

Xylakant 5143 days ago

I doubt that breakage would be minimal. I'm not as certain as you that most code does honor the spelling of class and method names, but even if we assume that this is the case I'd assume that there are tons of undetected errors. Currently, there's just no way to test that you're using the proper spelling, so nobody does.

Udo 5143 days ago

It would be trivial to write a command line tool to check (and maybe automatically fix) those typos. Incorporating it into a major new version also ensures everybody has enough time to prepare - and in case of hopeless and unfixable legacy apps: there is always the option not to upgrade.

Breaking changes in programming languages are not that uncommon, C# and Perl spring to mind from personal experience, but also to a lesser degree such things have happened with PHP itself (and it became better for it). In this case, it's actually a change that moves the runtime's behavior closer to what's expected. It's a change that improves internal consistency while also eliminating silly bugs like the one discussed here.

Xylakant 5143 days ago

I'm not against changing that behavior since it's arguably stupid and inconsistent [1], but I'm wary of "trivial" changes. It's not only apps that need fixing, pretty much every library needs checking (and maybe fixing). This cannot be done with a commandline check, since classnames can be constructed on the fly, called via eval() or call_user_func() etc. Class names may be loaded or even defined on the fly (the SOAP Pear Extension does this to create proxies). All those cases can only be checked by executing the program. It's probably a good change, but this is anything but trivial.

[1] actually, I don't care at all since I moved on to greener pastures.

acdha 5142 days ago

Having seen the hairballs of creative PHP in the wild, I'd think the only sane way to do this would be some sort of deprecation warning whenever a symbol lookup matched only case insensitively.

Xylakant 5142 days ago

Like the ones that they introduced to fix array[key_without_quotes] where key_without_quotes was mapped to a string if it was not a defined constant and a NOTICE was issued? The first thing everyone did was turn off E_NOTICE since practically all code emitted that notice. It took years until you could run apps with E_NOTICE turned on :)

nikic 5143 days ago

Just using LC_ALL="C" would break people using some other language to write code. I personally strictly disagree with writing code in anything but English, but other people think it's okay to have Russian class names or something. Using LC_ALL="C" would make this impossible.