Hacker News new | ask | show | jobs
by mjs7231 1888 days ago
Shouldn't this be considered a bug in Python? Why does it even try to evaluate 0xfor without the space? Trying a few other things..

* 0xfor1 evaluates.

* 1or 2 evaluates.

* 1or2 doesn't.

* ''or'foo' evaluates.

This is gross.

5 comments

That’s the normal way lexers work, given “tight” token definitions. They continue adding to the current token until an invalid (for the current token type) character is reached, and then begin parsing a new token starting with the “invalid” (but now valid for the next token) character (or the next non-whitespace character).

“1or2” is lexed into “1” (integer) followed by “or2” (identifier), which is valid on the lexer level but then fails on the grammar level.

The lexer unfortunately is a greedy token matcher. As soon as the 0xf "made sense" to it, and 0xfo did not - it did the same thing it would do in the case of something like 0xf+3. Except the + was an `or` in this case which is kosher. There is an idempotent step you can take where extra spaces are added before the AST is formed to make this sort of thing easier. The good news is, with a decent lint / format flow - these sorts are easy to catch.
Probably a lexer bug. "foo"or should never be processed as "foo" and token OR
why not? "or" is an operator, like "+", "foo"+"bar" should be valid. Why have special inconsistent case for "or".
You know what? You’re right.

I guess it’s an operator token after all.

Not by design, fortunately: https://bugs.python.org/issue43833
That hasn't been confirmed. Are we sure that it's not an inherent ambiguity in the grammar?
That's not totally clear. A bug being filed doesn't mean it's accepted. And this has been (ab)used for quite some time in various python codegolf. See https://codegolf.stackexchange.com/a/56 from 2011.
It is 100% by design. It's even documented.

https://docs.python.org/3/reference/lexical_analysis.html#wh...

This is not a bug.

The cited documentation says:

> Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

The two tokens in this case are "0xf" and "or". Their concatenation cannot be interpreted as a different token, because "0xfor" is not a valid token. Therefore, if I'm reading the rule correctly, whitespace is needed in this case.

"0xffor" is another interesting case. It's also not a valid token, but it could be interpreted as two tokens in two different ways: "0xf" "for" or "0xff" "or". (Python does the latter. I presume it uses something like C's "maximal munch" rule.)

Your conclusion is the exact opposite of what the documentation explicitly states.

> Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

Because the concatenation of "0xf" and "or" can't be interpreted as a different token, the whitespace is not needed.

You're right, and I was wrong.

I dislike the rule, and I strongly think that "0xfor" should require whitespace between "0xf" and "or" (I'm sure that influenced my reading), but you're right about what the rule says.

(Apparently I can't edit my previous comment.)

It is not strictly speaking a bug, since it works as intended. But it is clearly a counter-intuitive behavior and could be improved. Making 0xfor a syntax error would definitely be an improvement.

But requiring whitespace between all tokens is not an acceptable solution, since "2+2" should work. Always equiring whitespace between alphanumerical characters in different tokens would make sense.