Hacker News new | ask | show | jobs
by smegel 3643 days ago
> This means that it doesn't really matter how quickly you execute the "Python" part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them.

That doesn't really explain why Python is slow. Your just explaining how Python works. Why should C code be slow? Usually it is fast. Just saying the opcodes are complex doesn't really help, because if a complex opcode takes a long time, it is usually because it is doing a great deal.

Java used to have the opposite problem. It was doing too much at the "Java bytecode" level, such as string manipulation - so they added more "complex" opcodes written in C/C++ to speed things up, significantly.

What you really need to explain is why Python is inefficient. Bloated data structures and pointer hopping for simple things like adding two numbers may be a big reason. I know Perl had many efficiencies built in, and was considered quite fast at some point (90s?).

7 comments

> What you really need to explain is why Python is inefficient.

Python is extremely dynamic and this makes things hard for someone who wants to build a JIT.

The powerful bits of python metaprogramming makes it really impossible for a JIT to say with some certainty across all running threads that what it is doing is right.

Inlining a simple call like a.x() is rather hard when everything underneath can move around - I am not saying that it always does, but implementing a python variant which is nearly the same isn't very useful.

Compare this to PHP, which has fixed method calls (unless you use runkit, which you shouldn't) - a->x() will always be the same method as long as there was an a->x which was valid.

The method will never change once it is has been validated.

However unlike Java, both languages end up not knowing exactly what type "a" will be when the method is being called.

Java also doesn't quite know, but only when the invoke is via an interface. But the engine at least knows exactly how many impls of that interface has been loaded so far (and the bi-morphic case is commonly 1 real impl and 1 mock impl).

But both in case of PHP and Python, the whole idea of "which object do I have look up ::x() for?" is an unknown. In PHP's case, you have to look it up once per class encountered and in Python's case, you have to verify someone hasn't replaced it at runtime.

There are very nice functional ways around this problem at the bottom end of numeric loops for Python, which makes it great for numeric processing interleaved with generic control flow.

numpy + numba is a great way of limiting all this and getting performance out of a simple loop. And I'd rather use numpy + a python script doing regexes rather than a C program + LAPACK.

But that performance doesn't translate over when you have class/object oriented structures or in general, just multi-threaded web/rpc style code.

JavaScript has the same problems, and that hasn't stopped every major JS engine from building a JIT.
JS is single-threaded which makes an enormous difference to actually squeezing performance out of your JIT.

Just building a JIT for a language generally isn't the hard part. Building a JIT that is substantially faster than a bytecode-compiled implementation of the language is what's hard, and how hard that is depends intimately on the semantics of the source language. When I say intimately, I mean every single detail of the language's semantics matter.

This article is a follow up to an earlier post (https://blog.pyston.org/2016/06/30/baseline-jit-and-inline-c...) which provides additional context. In short, python opcodes are inefficient because they must support a significant amount of dynamicity, even as opposed to some other language peers like JS, PHP, Lua, etc. The original post explains some of the cool stuff they're doing with Pyston (specifically baseline jit and inline caching) to combat it.
This is also why tensor flow is such a resource hog compared to torch. My mac book pro has intel iris graphics so no cuda, but I was able to get usable results with torch, still struggling to get tensor flow to produce anything useful. Compared to lua python is very hungry, but also Apple is going to need to have a decent GPU option in its new macbook pros to keep people who work on ai projects from bailing for hackingtoshes.
I don't know a ton about low level language implementation, so excuse me if this comment is misguided, but...

Does type hinting in Python 3.x (PEP 484) change this (i.e., reduce the overhead due to dynamicness)? I know it doesn't include runtime type checking, so perhaps it's moot but maybe someone has written an interpreter that does hard type checking either at runtime or through a compilation step.

> Does type hinting in Python 3.x (PEP 484) change this

No. The type hints are only meant for tooling (type checking, refactoring, documentation, etc.).

> Java used to have the opposite problem. It was doing too much at the "Java bytecode" level, such as string manipulation - so they added more "complex" opcodes written in C/C++ to speed things up, significantly.

Where'd you get that idea? It's because of advanced JITs that Java got a lot faster.

- former JVM hacker

This. The article does not explain at all where exactly all the processor cycles are going. It's basically just saying "it's not the Python languages' fault!" but fails to name a specific culprit.

It says it's spending the cycles in the "C Runtime" but what exactly does it (have to) do in the C Runtime that eats up performance?

> It's basically just saying "it's not the Python languages' fault!"

The article is actually saying the exact opposite. It claims Python the Language is slow because the opcodes need to do a lot of work according to the language specification. Python is not slow because the core team has done a poor job implementing the opcode interpreter and runtime.

When you have a language with thin opcodes that map closely to processor instructions then compiler improvements lead to smarter opcode generation which translates to efficient machine code after jitting. When you have fat opcodes you're SOL.

Consider this: an instruction like STORE_ATTR (implements obj.name = value) has to check whether the top of the stack refers to an object, then whether writing to the object attributes is allowed. Then "name" has to be checked if it's a string. Perhaps additional checking is needed to normalize the string or other unicode testing. Then the assignment has to happen which is a dictionary update (internally a hash map insertion which may trigger a resize). This is just the tip of the iceberg. A lot more stuff is happening and the correct exceptions are thrown when the instruction is used incorrectly (which leads to code bloat which hurts the instruction cache).

A thin bytecode instruction for STORE_ATTR could actually reduce the store to a single LEA machine code instruction (Load Effective Address).

The downside of a language with a thin instruction set is that the individual instructions can't validate their input. They have to trust that the compiler did its job correctly, a segfault or memory corruption will happen otherwise. One of Guido's goals when creating Python was that the runtime should never ever crash, even on nonsensical input. This pretty much rules out a thin instruction set right from the start. Simple concurrency is also a lot easier with a fat instruction set, because the interpreter can just yield in between instructions (Global interpreter lock). With a thin instruction set there is no clear delineation between instructions where all memory is guaranteed to be in a stable state. So a different locking model is needed for multi-threading, which adds even more complexity to the compiler and runtime.

All the problems you're describing are solved with a powerful JIT. And the core team do seem to be opposed to doing the work needed for that.
Python's philosophy chooses simplicity over everything else. Simple grammar. Simple bytecode instruction set. Simple CPython implementation. Simple threading model (GIL). Simple data structures. Adding a highly sophisticated and complicated JIT on top of that makes little sense.

It's not so difficult to create a high performance language that's much like Python. It's just not possible to make Python fast without abandoning some of its founding principles.

Why is a simple CPython implementation such an important requirement?

Portability? Make the JIT optional.

Ease of maintenance? Get a small team of experts to maintain it on behalf of everyone else.

Openness to beginners? That would be nice if possible as well, but CPython's job is to run programs rather than to educate.

A JIT needn't make the grammar, bytecode or threading model more complex. It would make data structures and the implementation more complex, but do you not think that's worth it if Python could be twice as fast?

> CPython's job is to run programs rather than to educate.

CPythons 'job' is to be the reference implementation of Python.

To be fair, the GIL wasn't included because it was a simple threading model (AFAIK). It was included because it was simple to implement and it was/is fast(er) (than removing it)[1][2].

If the Gilectomy [2] project succeeds, Guido has mentioned he would consider it for Python3.6+ [3].

[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[2] https://www.youtube.com/watch?v=P3AyI_u66Bw

[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s

Hence why I rather support Julia and leave Python for shell scripting like tasks.
coughsufficiently smart compilercough.
No we have compilers that can do these things today.
A small nit: LEA does the calculation but doesn't read or write from that address. In times of old, this instruction used the memory addressing port to do the calculation, but these days it's just a normal arithmetic instruction with slight difference that it doesn't set the flags based on the result. Instead, the ideal would be for the bytecode to reduce to a single MOV. In addition to loading and storing, MOV itself supports several forms of indexed and indirect address calculation which execute on dedicated ports without adding latency.
Which complex bytecodes did Java introduce?

Since hotspot was introduced,the amount of C++ on the reference JDK has been incrementally reduced between releases,with Hotspot improving its heuristics.

To the point that Graal is a pure Java JIT.

Also in the 80's and early 90's C compilers generated awfully slow code.

C developers have to thank almost 40 years of research in C optimizers and misuse of UB optimizations for the current state of C compilers quality in code generation.

The author means the Hotspot JVM has added intrinsics for memset, autovectorization, etc.

I don't agree with the main point of the article though. Pypy, ZipPy, etc. have shown that there are real gains from running the actual Python code faster.

>>I know Perl had many efficiencies built in, and was considered quite fast at some point (90s?).

There are a lot of threads in Perlmonks that talk in detail about speeding up Perl, related project et al.

To be summarizing it. Languages like Perl and Python are slow because they do a lot of work out of the box that languages like C don't. There fore when you talk of talk of translating Python to C, or Perl to C. Essentially what you are talking of is translating all that extra action back into C, which will run as fast as Perl or Python itself.

The more you make it easy for the compiler to interpret the faster it can run and vice versa.

Python is slow for the very reason its famous, its easy for the programmer.

Lisp and Smalltalk like languages run circles around Python and Perl performance.

Enjoy the same powerful features, have JIT and AOT compilers to native code.

It all boils down to how much the language designers care about performance.

And also how much the language designers care about proper language design.
I'm not sure if this really answers the question though. There are plenty of languages that do more work than C does that are not as slow as python. Do they do less work than python? Maybe, but they certainly do more work than C. The question is, even if they do less work than python, is the extra work python doing valuable to you?
Lua is almost as dynamic and flexible as Python and is very easy for the programmer. LuaJIT's performance is close to that of C.
I think part of the reason for this, though, is that Lua is a very "thin" language. It purposefully doesn't have a lot of fancy features (I mean, come on, it literally has one composite datatype, and you're supposed to use them [tables] for everything from arrays to maps to full objects). By removing a lot of assumptions that make Python "easy", Lua has made it much easier to write optimizing JITers and compilers since runtime behavior is (perhaps a little paradoxically) easier to predict. And it doesn't make Lua any less "easy" a language, it's just a different set of rules to follow. Guido's hunt for the simplest programming language possible is noble, but sometimes it causes poor decisions to be made imho.
Well, just want to say for one example here, why Python is hard to speedup than JS.

In Python, essentially every object is a dictionary. And behavior like, "x.y = 'z'" is legal, meaning pretty much everything could be put into object x without declaration ahead of time, at all. While, not very sure, JS doesn't allow such behavior, you can still do assignment, but assessing will yield undefined.

The above behavior come as default(you can use __slot__ for sure, but compiler can't assume that) is pretty problematic for optimization, compiler simply won't know where to locate a specific variable ahead-of-time, so it cannot substitute the variable access using a pointer offset, it will have to go through a hash lookup, which comparing to the offset alternative, is magnitude slower.

Except that LuaJIT does something close to just a pointer offset. The trick it uses is to use a hash function that is known by the compiler so when you have a constant key ("y" in this case) the hash table set is compiled down to 3 instructions (imagine hash("y") = 0xa7):

1. Load whatever necessary to check if x[0xa7] is occupied 2. Branch to slow path if occupied 3. Store x[0xa7] = 'z'

And a modern super-scalar CPU will even do some of this in parallel.

You can generally do that in JS as well. (Added properties are sometimes called "expandos".)
True. Just find out. But Chrome's console yield undefined to me...Weird.

Edit: find the reason. It is because I am trying to assign a property to int. Seems JS won't allow this to primitive type?

Edit: Python won't allow assignment to primitive type bject either. The handling is the same.