| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by beagle3 3645 days ago

(based on discussion, can't get to website)

Any discussion that does not compare to LuaJIT2 is suspect in its conclusions. On the surface, Lua is almost as dynamic as Python, and LuaJIT2 is able to do wonders with it.

Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
       y = os.path.join(PATH, x)
           process_file(y)

There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration. I'm not sure why it isn't anymore - it would also likely help PyPy and Pyston produce better code.

This is by no means the only thing that makes Python slower - my point is that if one looks for inherently slow issues that cannot be solved by an implementation, comparison with Lua/LuaJIT2 is a good way to go.

6 comments

dagw 3645 days ago

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration.

I'll admit to being skeptical to how big an effect this would have, but it was bigger than I thought:

  %%timeit
  s=0
  for i in itertools.repeat(1.0, 1000000):
      s+=math.sin(i)

  10 loops, best of 3: 124 ms per loop

vs

  %%timeit
  s=0
  math_sin=math.sin
  for i in itertools.repeat(1.0, 1000000):
      s+=math_sin(i)

  10 loops, best of 3: 89.1 ms per loop

fovc 3645 days ago

Not sure how this timeit is running, but there are two things going on: Local name lookups are done as array indexing; whereas global lookups are done as cascading dictionary lookups. And the attribute lookup also is a dictionary search. On instances, looking up a method defined in a super class involves failed lookups in the instance and all the base classes in a definition. In a hot inner loop, definitely worth cutting out

bluejekyll 3645 days ago

But why does the programmer need to do this? Should the language interpreter be able to implicitly perform this operation?

Isn't that what the .pyc files are for, so that it doesn't need to perform lookups like this at runtime?

beagle3 3645 days ago

There is a difference in semantics. Someone else might actually want os.path to change between call. The only problem is that the behavior of the idiomatic version is hard (almost impossible) to optimize, even though the more optimizable semantics are actually what most users want.

The language was designed with expressiveness in mind, and it often comes at the expense of speed. Lua and Nim seem to strike a much better balance, and even JavaScript if you avoid the performance killers like "with".

cwyers 3645 days ago

> Someone else might actually want os.path to change between call.

Who? Why?

fovc 3645 days ago

The os.path is not a great example, but imagine a loop where inside the body you mutate some state of an instance and then directly access it. Compare the following with the os.path example:

  for animal in circus.animals:
      circus.next_free_clown.assign(animal)
      # circus.next_free_clown changes in every iteration

zardeh 3645 days ago

Monkeypatching for debugging, mocking, etc.

Most of those are rare, but feasible. Monkeypatching a global as a side effect even rarer, but I think I've done it at some point.

dr_zoidberg 3645 days ago

No, the .pyc files are just the source code translated to bytecode when you import the .py file (notice that running with "python somefile.py" doesn't generate a somefile.pyc, the .pyc is created when you import, from inside python or another script).

.pyc files are created for "faster importing", and they remove having to parse and "compile to bytecode" the .py files.

qihqi 3645 days ago

The exact issue is that it cannot. The reason is variables can change on the runtime, and often dependant on input. Say I can have : if blah: os.path.join = lambda ...

and then the pre-looked-up version of method are now wrong.

trishume 3645 days ago

Except that Lua has the almost the same problem, and Mike Pall solved it in LuaJIT. So did the Truffle/Graal team with JRuby+Truffle.

Lua has the same property that global methods are hash table lookups. What LuaJIT does is that it specializes hash table lookups to the key so that looking up an item in a hash table is two instructions which can be executed at the same time on a super-scalar CPU.

LuaJIT also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state. JRuby+Truffle solves this with multithreading by assuming that it doesn't change and deoptimizing into a slow path if it detects a change.

bakery2k 3637 days ago

>> Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
        y = os.path.join(PATH, x)
        process_file(y)

>> There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

> LuaJIT also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state.

Would Lua code equivalent to the above Python have the same problem - that "process_file" could modify the value of "os.path"? Would this prevent LuaJIT from hoisting the lookup out of the loop, even in the absence of threads?

trishume 3645 days ago

Idea: I wonder if it would be profitable to have a mode for Pyston/PyPy/ZiPy/Whatever that disables threads, or perhaps an annotation that stops other threads from being scheduled inside certain loops.

This would allow so many checks to be hoisted out of the loop like LuaJIT does, that otherwise can't be hoisted because in theory some thread could re-assign os.path in the middle of it. If this could provide a 2-5x speedup, in most scenarios it would outweigh the performance benefit of parallelism, as well as being easier than paralellizing.

xamuel 3645 days ago

There isn't even any guarantee that "os.path" is the vanilla iterative lookup it's presented as. For all we know, it could be a @property-decorated method thousands of lines long.

beagle3 3645 days ago

But as long as you know it is the same thing, you could move ("hoist") it outside the loop. In Lua, you generally can (and LuaJIT2 does). In Python, rarely if ever.

smhx 3645 days ago

and that's when you dont do the optimization.

bakery2k 3637 days ago

Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations

Why is this problem not shared by Lua? Does Lua somehow prevent "process_file" from modifying the equivalent of "os.path"?

jblow 3645 days ago

Due to aliasing problems, you have exactly this issue in C and C++ as well, yet those languages are both massively faster than the languages in question.

kayamon 3645 days ago

While true, an aliasing 'miss' in C just causes a pointer reload, while the same miss in Python causes a hash table lookup.

beagle3 3645 days ago

it causes much more than a hash table lookup: it causes an attribute lookup, which is ...

     1. an instance __dict__ lookup, 
     2. a __getattr__ call (itself involving another hash lookup)
     3. a class property lookup
     4. superclass property lookups (if there are any superclasses)

If it was just a hash table lookup, it would be LuaJIT2 fast. But it's more like 10 hash table lookups and indirections, and that's assuming you don't actually try to hook any of this machinery to do something.

pcwalton 3645 days ago

A C compiler doesn't have to assume that the definitions of functions called directly can change. LLVM couldn't optimize calls to memcpy if it couldn't assume its semantics, just to name one example.