Hacker News new | ask | show | jobs
by beagle3 3645 days ago
(based on discussion, can't get to website)

Any discussion that does not compare to LuaJIT2 is suspect in its conclusions. On the surface, Lua is almost as dynamic as Python, and LuaJIT2 is able to do wonders with it.

Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
       y = os.path.join(PATH, x)
           process_file(y)
There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration. I'm not sure why it isn't anymore - it would also likely help PyPy and Pyston produce better code.

This is by no means the only thing that makes Python slower - my point is that if one looks for inherently slow issues that cannot be solved by an implementation, comparison with Lua/LuaJIT2 is a good way to go.

6 comments

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration.

I'll admit to being skeptical to how big an effect this would have, but it was bigger than I thought:

  %%timeit
  s=0
  for i in itertools.repeat(1.0, 1000000):
      s+=math.sin(i)

  10 loops, best of 3: 124 ms per loop
vs

  %%timeit
  s=0
  math_sin=math.sin
  for i in itertools.repeat(1.0, 1000000):
      s+=math_sin(i)

  10 loops, best of 3: 89.1 ms per loop
Not sure how this timeit is running, but there are two things going on: Local name lookups are done as array indexing; whereas global lookups are done as cascading dictionary lookups. And the attribute lookup also is a dictionary search. On instances, looking up a method defined in a super class involves failed lookups in the instance and all the base classes in a definition. In a hot inner loop, definitely worth cutting out
But why does the programmer need to do this? Should the language interpreter be able to implicitly perform this operation?

Isn't that what the .pyc files are for, so that it doesn't need to perform lookups like this at runtime?

There is a difference in semantics. Someone else might actually want os.path to change between call. The only problem is that the behavior of the idiomatic version is hard (almost impossible) to optimize, even though the more optimizable semantics are actually what most users want.

The language was designed with expressiveness in mind, and it often comes at the expense of speed. Lua and Nim seem to strike a much better balance, and even JavaScript if you avoid the performance killers like "with".

> Someone else might actually want os.path to change between call.

Who? Why?

The os.path is not a great example, but imagine a loop where inside the body you mutate some state of an instance and then directly access it. Compare the following with the os.path example:

  for animal in circus.animals:
      circus.next_free_clown.assign(animal)
      # circus.next_free_clown changes in every iteration
Monkeypatching for debugging, mocking, etc.

Most of those are rare, but feasible. Monkeypatching a global as a side effect even rarer, but I think I've done it at some point.

No, the .pyc files are just the source code translated to bytecode when you import the .py file (notice that running with "python somefile.py" doesn't generate a somefile.pyc, the .pyc is created when you import, from inside python or another script).

.pyc files are created for "faster importing", and they remove having to parse and "compile to bytecode" the .py files.

The exact issue is that it cannot. The reason is variables can change on the runtime, and often dependant on input. Say I can have : if blah: os.path.join = lambda ...

and then the pre-looked-up version of method are now wrong.

Except that Lua has the almost the same problem, and Mike Pall solved it in LuaJIT. So did the Truffle/Graal team with JRuby+Truffle.

Lua has the same property that global methods are hash table lookups. What LuaJIT does is that it specializes hash table lookups to the key so that looking up an item in a hash table is two instructions which can be executed at the same time on a super-scalar CPU.

LuaJIT also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state. JRuby+Truffle solves this with multithreading by assuming that it doesn't change and deoptimizing into a slow path if it detects a change.

>> Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
        y = os.path.join(PATH, x)
        process_file(y)
>> There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

> LuaJIT also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state.

Would Lua code equivalent to the above Python have the same problem - that "process_file" could modify the value of "os.path"? Would this prevent LuaJIT from hoisting the lookup out of the loop, even in the absence of threads?

Idea: I wonder if it would be profitable to have a mode for Pyston/PyPy/ZiPy/Whatever that disables threads, or perhaps an annotation that stops other threads from being scheduled inside certain loops.

This would allow so many checks to be hoisted out of the loop like LuaJIT does, that otherwise can't be hoisted because in theory some thread could re-assign os.path in the middle of it. If this could provide a 2-5x speedup, in most scenarios it would outweigh the performance benefit of parallelism, as well as being easier than paralellizing.

There isn't even any guarantee that "os.path" is the vanilla iterative lookup it's presented as. For all we know, it could be a @property-decorated method thousands of lines long.
But as long as you know it is the same thing, you could move ("hoist") it outside the loop. In Lua, you generally can (and LuaJIT2 does). In Python, rarely if ever.
and that's when you dont do the optimization.
Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations

Why is this problem not shared by Lua? Does Lua somehow prevent "process_file" from modifying the equivalent of "os.path"?

Due to aliasing problems, you have exactly this issue in C and C++ as well, yet those languages are both massively faster than the languages in question.
While true, an aliasing 'miss' in C just causes a pointer reload, while the same miss in Python causes a hash table lookup.
it causes much more than a hash table lookup: it causes an attribute lookup, which is ...

     1. an instance __dict__ lookup, 
     2. a __getattr__ call (itself involving another hash lookup)
     3. a class property lookup
     4. superclass property lookups (if there are any superclasses)
If it was just a hash table lookup, it would be LuaJIT2 fast. But it's more like 10 hash table lookups and indirections, and that's assuming you don't actually try to hook any of this machinery to do something.
A C compiler doesn't have to assume that the definitions of functions called directly can change. LLVM couldn't optimize calls to memcpy if it couldn't assume its semantics, just to name one example.