Attempts to make Python fast

Y	Hacker News new \| ask \| show \| jobs

	Attempts to make Python fast (sethops1.net)
	135 points by Queue29 2072 days ago

18 comments

the_mitsuhiko 2071 days ago

Python is fundamentally not designed to be faster because it leaks a lot of stuff that’s inherently slow that real world code depends on. That’s mutable interpreter frames, global interpreter locks, shared global state, type slots, the C ABI.

The only way to speed it up would be to change the language.

link

overgard 2071 days ago

I don't think that's really true, those things are definite challenges, but PyPy is still significantly faster than CPython while (afaik) allowing that sort of stuff to go on. If you wanted C/Rust level performance than yeah, you need to redesign the language, but if you just want an interpreter that runs 5-10x faster than what they have now? Both doable and has been done.

link

Rochus 2071 days ago

PyPy is about four times faster than CPython (https://speed.pypy.org/) which is not that much compared to the effort. Node.js is about 13 times faster (https://benchmarksgame-team.pages.debian.net/benchmarksgame/...).

link

nomel 2071 days ago

For a funny perspective, there’s a python interpreter, written in JS, that is toe to toe with cPython, and faster in many benchmarks: https://brython.info/speed_results.html

link

kzrdude 2064 days ago

Super cool - but how sad that it's only in the browser, not a complete Python implementation. The page forgets to mention if lower or higher is better.

link

overgard 2071 days ago

4 times is a huge boost, and that's on average. For certain operations it's much much faster. Also comparing a volunteer project to an interpreter that has the resources of google behind it is IMO pretty unfair.

Also saying the effort of PyPy is purely around speed is misleading. After all, another huge goal of the project was to implement a python interpreter in python, which they succeeded at.

link

Rochus 2071 days ago

> 4 times is a huge boost, and that's on average.

It's in geometric mean, not average (see http://ece.uprm.edu/~nayda/Courses/Icom5047F06/Papers/paper4...). The same principle is applied to all testees. It's normal that certain benchmarks run faster than others. That's why we compare geometric means.

> Also comparing a volunteer project to an interpreter that has the resources of google behind it is IMO pretty unfair.

Didn't the project run for nearly twenty years with seveal rounds of EU funding? I think it's rather the approach than the team size or corporate support. See e.g. LuaJIT which was implemented by a single person in a shorter time frame and achieves similar performance like Node.js.

> Also saying the effort of PyPy is purely around speed is misleading

Didn's say that. But unfortunately also the other RPython based implementations also don't seem to be faster.

link

Rochus 2071 days ago

So here some references in case you don't believe that there was EU funding:

https://doc.pypy.org/en/release-1.9/index-report.html

https://ieeexplore.ieee.org/document/1667583

https://mail.python.org/pipermail/pypy-dev/2004-December/001...

https://en.wikipedia.org/wiki/PyPy#Funding

link

the_mitsuhiko 2071 days ago

And PyPy breaks some Python code (eg: most C extensions are very slow) in the process. PyPy is a different dialect of Python.

link

overgard 2071 days ago

Slow != Breaks. I've run plenty of production python code in pypy. I'm sure it's not appropriate everywhere, but I wouldn't go so far as to call it a separate dialect.

link

Rochus 2071 days ago

CPython could implement an alternative, more efficient FFI (such as e.g. the one by LuaJIT) which would not slow down PyPy. So people could gradually migrate.

link

sk2020 2071 days ago

Being 13x faster is not a compelling enough reason to use Node, in my opinion.

link

the_mitsuhiko 2071 days ago

> PyPy is still significantly faster than CPython while (afaik) allowing that sort of stuff to go on

First of all that's only true when it managed to jit the code, secondly only until you try to do any of those slow things. For instance the C ABI emulation they have both cannot support all of CPython and wrecks performance. The same is true if you try to do fancy things with sys._getframe which a lot of code does in the wild (eg: all of logging).

In addition PyPy has to do a lot of special casing for all the crazy things CPython does. I recommend looking into the amount of engineering that went into it.

link

overgard 2071 days ago

Yeah, but most code doesn't use things like sys.getframe? I don't see the problem here. You can choose whether those features are worth the speed penalty or not.

And yeah the C ABI is slow, but that's true of practically every language. Again, it's a choice of if you use those things or not. That doesn't devalue making other parts of the language faster.

link

kissgyorgy 2071 days ago

PyPy is faster at the price of higher memory usage, which is not always desirable.

link

ilyagr 2071 days ago

I found the following talks by Armin Romacher very informative on these topics (C API, why python is more difficult to speed up than JS).

https://youtu.be/qCGofLIzX6g https://youtu.be/IeSu_odkI5I

I wish these were the things Python 3 addressed, rather than Unicode. I guess it's much more obvious in hindsight than back when Python 3 was designed.

link

mumblemumble 2071 days ago

I would guess that, if Python 3 hadn't addressed Unicode, Python would never have come to a place where so many people are worried about its performance.

Python's still a great language for the things it was being designed for back in the 2000s. But adding decent Unicode support is a big part of what helped it become an attractive language for use cases where I wish it performed better or had better support for parallelism. Natural language processing, for example.

link

kgwgk 2071 days ago

Any other example? Because there are lots of high-performance computing tasks where unicode couldn't matter less.

link

kec 2071 days ago

How do you square that assertion with the fact that people clung so hard to python 2 that it took the PSF 12 years to finally kill it?

link

mumblemumble 2071 days ago

Some people clung hard to 2. Others flocked to 3.

In my direct experience, the only people who waited until the bitter end (and beyond) were ops folks who never had to stray much outside of 7-bit ASCII, and companies with large existing codebases that didn't want to allocate the resources to migrating. Neither of those really have much to do with my assertion that Python 3 attracted new people doing new things.

link

arc776 2071 days ago

Just want to say thanks for these links, very interesting so far.

A point made in the video that seems to highlight the issue:

> Just adding two numbers requires 400 lines of code.

In compiled languages, this is one instruction! Think about the cache thrashing and memory loading involved in this one operation too. How can this possibly be fixed?

Python is a great language, but I don't know if it can ever be high performance on its own.

link

eesmith 2071 days ago

Which compiled language adds 680564733841876926926749214863536422912 and 35370553733215749514562618584237555997034634776827523327290883 in one instruction?

FWIW, here's the relevant dispatch code in Python's ceval.c where you see it uses a very generic dispatching at that level, which eventually, deeper down, gets down to the "oh, it's an integer!"

        case TARGET(BINARY_ADD): {
            PyObject *right = POP();
            PyObject *left = TOP();
            PyObject *sum;
            /* NOTE(haypo): Please don't try to micro-optimize int+int on
               CPython using bytecode, it is simply worthless.
               See http://bugs.python.org/issue21955 and
               http://bugs.python.org/issue10044 for the discussion. In short,
               no patch shown any impact on a realistic benchmark, only a minor
               speedup on microbenchmarks. */
            if (PyUnicode_CheckExact(left) &&
                     PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(tstate, left, right, f, next_instr);
                /* unicode_concatenate consumed the ref to left */
            }
            else {
                sum = PyNumber_Add(left, right);
                Py_DECREF(left);
            }
            Py_DECREF(right);
            SET_TOP(sum);
            if (sum == NULL)
                goto error;
            DISPATCH();
        }

Python code can be made more high performance if there's some way to tell the implementation the types, either explicitly or by inference or tracing. That's how several of those listed projects get their performance.

link

arc776 2071 days ago

Of course bigints require more than one instruction to add them, but even then you can reduce the work at compile time down to a series of integer operations, whereas the above code requires interpretting the program before it even gets to the add.

In your example text processing in `unicode_concatenate` is going to be very, very much slower than a bulk load of the native numerical data directly from memory and processing it. For each character, Python needs to check a number is still a number at run time then convert the result to a native numeric. I can only assume this string processing is at worst performed once and cached(?), because otherwise it doesn't seem like it would run well at all and surely Python's bigint performance is pretty important.

> Python code can be made more high performance if there's some way to tell the implementation the types, either explicitly or by inference or tracing.

At that stage, I would just use Nim and get better performance and a decent static type system included and either call it from Python, or call Python from Nim.

link

eesmith 2071 days ago

You did write "two numbers" ;)

Guess I could also have used 5j + 3 as a counter-example.

If this is an issue then at this stage, many Python people switch to use one of the alternatives mentioned here, like Cython, which is a Python-like language which includes a static type system (including support for C++ templates) and can easily generate C extensions that can call and be called from Python.

link

the_mitsuhiko 2066 days ago

Note that the version of BINARY_ADD you're looking at is newer than what the talk referenced. The "fast path" for integer addition was removed which the talk still talked about. You can see a discussion about that linked in the comment of the code you pasted.

link

baq 2071 days ago

unicode absolutely had to be done. it'd be even more insane to leave strings as they were. maybe if you never venture outside of 7 bits it's only pain with negative ROI, but trust me the world has more languages than english and first-class support for unicode strings as just strings is a must. it was a painful transition but a necessary one. all other modern languages simply started there (and they're old enough to have a beer, too).

link

carabiner 2071 days ago

(OP is Armin)

link

rmrfstar 2071 days ago

Python is the duct tape of programming languages.

Never the best tool if you have strict performance requirements, but so damn versatile it will never go away.

Cython does need better docs though, the steep learning curve means it is under-utilized.

link

johnisgood 2071 days ago

> Python is the duct tape of programming languages.

For some that glue is Forth. :D

> A guy named Jean-Paul Wippler is considering using Forth as a super glue language to bind Python Perl and Tcl together in a project called Minotaur (http://www.equi4.com/minotaur/minotaur.html).

> Forth is an ideal intermediary language, precisely because it's so agile. Otherwise, it wouldn't have been chosen for OpenFirmware, which when you think about it, is a Forth system that must interface to a potentially wide variety of programming language environments.

link

bydo 2071 days ago

We said the same thing about Perl for a couple decades.

link

rmrfstar 2071 days ago

I feel like Python's readability and interoperability with C will give it more staying power.

Is this wishful thinking?

link

user5994461 2071 days ago

Python is everywhere. The amount of python code that is written every day is staggering. It will be there 30 years from now doing the same thing and people will be looking at that code like it's COBOL.

Perl is extinct in comparison. It's not been used for any projects anywhere for a long long time.

link

jnxx 2071 days ago

> Perl is extinct in comparison.

Which is a good example that the decrease in use can go a lot faster than you think. Perl was widely used in 2000, and thought to be on par with Python. Similarly Visual Basic which nobody seems to remember any more.

Also, COBOL is simply used because it is uneconomical to rewrite those old programs, not because it is a good language to write new stuff in. But the heavy dependency of Python programs on libraries hosted across the web means that obsolescence can happen a lot faster today; a COBOL program is almost totally self-contained in comparison.

link

ryl00 2071 days ago

Plenty of us still do.

link

hombre_fatal 2071 days ago

As a former Perl user, I think it's time we finally derank "plenty" into just "handfuls".

link

Foober223 2071 days ago

Perl is back!

http://perlcommunity.org/

link

edsac_xyzw 2068 days ago

This duct type is the Python native extension api (not python ctypes) which allows creating native code modules (aka libraries) in C or C++ and creating wrappers to existing C or C++ libraries. This escape hatch that enables offloading cpu-intensive computations to high performance libraries written in C, C++ or Fortran. Another benefit of python modules written in C or C++ is that they are not affected by the GIL (Global Interpreter Lock) problem, thus they can take advantage of multi-core and SIMD instructions and achieve higher performance.

link

centimeter 2071 days ago

And like things made out of duct tape, I’ve never found anything made using python that actually functioned well.

link

NortySpock 2071 days ago

Did it function well enough for your purposes?

Because I don't need it to be hermetically-sealed perfection, I just need my python code to spit out a good result when I throw it at a problem; nevermind that it took a few seconds to spin up or needs more memory than a perfectly crafted C program.

link

centimeter 2071 days ago

Most python apps don’t function well enough for my purposes.

link

coldtea 2071 days ago

Then you have bizarro purposes, because Python is in production all over the world, in business critical, billion at stakes, systems...

link

entropicdrifter 2071 days ago

It might never be the best tool for the job, but it has certainly saved lives before: https://confessionsoftheprofessions.com/interesting-facts-of...

link

gameswithgo 2071 days ago

AlphaZero?

link

Liquid_Fire 2071 days ago

You could say the same about JavaScript, but with very heavy investment there are now several implementations that have improved its performance significantly.

Also see PyPy, which manages to squeeze a lot more performance out of Python for many use cases without changing the language.

link

ihnorton 2071 days ago

The principal developer of Pyston commented on the JavaScript comparison recently [1]:

> This is a common view but I've never heard it from someone who has tried to optimize Python. Personally I think that Python is as much more dynamic than JavaScript as JavaScript is than C.

[1] https://news.ycombinator.com/item?id=23247618

link

awestroke 2071 days ago

JS does not have mutable interpreter frames, global interpreter locks, shared global state, type slots, the C ABI.

link

dfox 2071 days ago

JS and Python has essentially same data model with everything being at least conceptually built out of dicts.

And well, most JS implementations do not have GIL because they are not multithreaded at all.

link

baq 2071 days ago

true and doesn't matter in the context. you can't change (or inspect) the stack frame as an object. you can kind of look at it with Error().stack. this allows the JS JIT to make assumptions that a python compiler simply cannot.

link

dfox 2071 days ago

Most of the things that real application code (ie. not something like debugger) can accomplish by modifying or event inspecting frame objects are going to depend on various things that are documented as CPython implementation details. Also JS runtime that supports some kind of debugger interface has to solve the same class of problems. And the most straightforward solution is not even that complex: you simply have to track when some kind of assumption gets broken and then fall back to interpretation or recompile the relevant code (the most complex part of that probably is converting the native stack frame back into interpreter stack frame, which you have to be able to do anyway in order to even expose it to user code for it to be able to modify it).

In fact I think that there are many relatively simple modifications that would make CPython significantly faster, but many such things conflict with each other in ways that make the resulting complexity not worth it.

link

arc776 2071 days ago

The thing about Javascript is it's actually a very simple language. You can make a lot of guarantees and this means performance patterns can be implied.

Ultimately JS can be reduced to a very tight engine. This is not possible with Python, it's just too dynamic.

link

arc776 2071 days ago

I completely agree. Everything is baked in to be slow. There is no way around it, I don't think you can write super fast interpreters like with Javascript - I might be wrong, but so far it hasn't happened.

For general use cases the performance is fine, but only thanks to the hard work of C/CPython/Cython programmers who give up Python's rich expressibility to gain this performance. It seems like you simply have to use another language to get anything running fast.

Having said all that, Pyc seems interesting as it apparently compiles Python. Has anyone had any experience of this?

link

chrisseaton 2071 days ago

> There is no way around it

What aspects of the language are you convinced cannot be optimised? There's tons of research in this area.

link

arc776 2071 days ago

As the OP states:

> mutable interpreter frames, global interpreter locks, shared global state, type slots

On top of this, Python is extremely dynamic and nothing can be assured without running through the code. So this leads to needing JITs to improve performance which then give a slow start up time and increased complexity. Even with JIT, Python is just not fast thanks to the above issues and it's overall dynamism.

It can be optimised and for sure there's some impressive attempts at doing so. However I don't think pure Python will ever be considered "fast" as these things necessarily get in the way.

I highly recommend the two videos posted here that go into more detail as to why there are limits to how far optimisation can go: https://youtu.be/qCGofLIzX6g https://youtu.be/IeSu_odkI5I

link

chrisseaton 2071 days ago

> why there are limits to how far optimisation can go

I'd challenge the idea that there really are known 'limits'. As I say there's research towards this, these videos are old, and Armin and Seth may not be up to date with all of the literature (in fact I'm sure Seth is not, as he's missing at least one major current Python implementation research project from his blog post.)

link

arc776 2071 days ago

> I'd challenge the idea that there really are known 'limits'.

There are good reasons why these limits cannot be overcome in that the complexity and dynamism of the language precludes it.

Being interpreted is one cost that sets a significant barrier to performance, and the dynamic complexity further compounds it. For example whereas JS is basically only functions, in Python you have a huge range of ways you can do incredibly complex things with slot wrappers, descriptors, and metaprogramming.

Ultimately, Python will get faster, but diminishing returns are inevitable. Python can never be as fast as the equivalent code in a compiled language. It simply has too much extra work to do.

link

MR4D 2071 days ago

C has a slow startup time as well. We just call that compiling.

Having the option to be slow startup/fast execution is a good option to have. Maybe not for some, but definitely needed by others.

link

jashmatthews 2071 days ago

Chris already proved we can do exactly that for Ruby with TruffleRuby. I don’t think there’s any reason GraalPython couldn’t do the same given more work?

link

arc776 2071 days ago

I'm not familiar with Ruby, but it doesn't seem like TruffleRuby is really competitive with languages known for performance.

These are only simple benchmarks, but do indicate a rough ballpark for TruffleRuby: https://github.com/kostya/benchmarks

As I understand it, Crystal would be a good Ruby alternative if you want performance. This is of course a whole new language designed with performance in mind from the beginning and here is a repeating theme: you need to consider performance at the start, not 20 years later.

link

kzrdude 2071 days ago

Could we seriously start looking at deprecating some of the features that make Python slow? Who needs mutable interpreter frames?

link

chrisseaton 2071 days ago

Why not make mutable interpreter frames fast instead?

link

mikepurvis 2071 days ago

It's true that there are certain non-negotiable costs there, and projects like Mercurial have invested heavily in trying to figure out how to make Python start up faster, and basically hit a brick wall (see: https://www.mercurial-scm.org/wiki/PerformancePlan).

That said, for a lot of other projects which haven't yet looked, there may be some low-hanging fruit. For example, I was doing some looking at this recently on a highly pluggable workspace build tool called colcon [1], and found that of 5+ seconds of startup time, I could save about 1 second with "business logic" changes (adding caching to a recursive operation), another 1 second by switching some filesystem operations to use multiprocessing, and about 1.5 seconds from making some big imports (requests, httpx, sanic) happen lazily on first use.

[1]: https://github.com/colcon/colcon-core/issues/398

link

jnxx 2071 days ago

That's really surprising if one considers for a moment how many things Python has in common with Common List, a language which can be compiled to run near C speed (albeit with some sacrifices on safety i.e. "unsafe" optimizations). And if anything, Python 3 has become more similar to Lisp, while running at 1 / 20 of its speed.

link

beagle3 2071 days ago

Python does have a lot in common with CL; but the problem with Python is that almost any call you cannot statically inline, which is most of them, can change the semantics of everything else - you've just called math.floor() ; are you sure it wasn't just monkeypatched to assign 7 to all local variables who have an 'x' in their name in the caller's frame?

Most of these uses are very rare, but the tail is incredibly long for Python, and the problem is that you can't even compile a "likely normal" and a "here be dragons" versions, and switch only when needed - you need to constantly verify. The same is not true, AFAIK, with Common Lisp - being a lisp1 and having a stronger lexical scope than python does.

Shedskin is a Python to C++ compiler that mostly requires the commonly-honoured constrained that a variable is only assigned a single type throughout its lifetime. (And that you don't modify classes after creation, and that you don't need integers longer than machine precision, and ....); While many programs seem to satisfy these requirements on superficial inspection, it turns out that almost all programs violate them in some way (directly or through a library).

The probability that Shedskin will manage to compile a program that was not written with Shedskin in mind is almost zero.

Nuitka was started with the idea that, unlike shedskin, it will start by compiling the bytecode to an interpreter-equivalent execution (which it does, quite well), to get a minor speed up - and then gradually compile "provably simple enough" things to fast C++; a decade or so later, that's not working out as well as hoped, AFAIK because everything depends on something that violates simplicity.

link

nemoniac 2071 days ago

> The same is not true, AFAIK, with Common Lisp - being a lisp1 and having a stronger lexical scope than python does.

Common Lisp is a Lisp2.

link

beagle3 2071 days ago

Thanks for the correction, indeed, I meant lisp2 even though I typed lisp1;

It makes it easier to compile than a lisp1 (to which Python is closer), because the standard call form s-expression can be bound early.

link

chrisseaton 2071 days ago

> That’s mutable interpreter frames, global interpreter locks, shared global state, type slots, the C ABI.

There's research towards solving all of these problems.

> The only way to speed it up would be to change the language.

Maybe we just haven't worked out how yet? Nothing you've mentioned is known to be impossible to make fast.

link

snicker7 2071 days ago

Python is ~100x slower than C. There is definitely wiggle room for improvement.

link

Rochus 2071 days ago

How would you explain then that LuaJIT is so much faster than CPython? Even the interpreter of LuaJIT is much faster.

> The only way to speed it up would be to change the language.

What specifically? Most of your points are not related to the language. And even current Smalltalk engines are much faster than CPython (see https://github.com/OpenSmalltalk/opensmalltalk-vm).

link

jashmatthews 2071 days ago

Lua doesn’t have assignment as an expression. Lua 5.1 has float as the only numeric type. Lua varargs are easier to implement.

Each VM op for Python or Ruby ends up being bigger and having more branches. For Ruby this is quite painful on the numeric types. Branching, boxing and unboxing is far slower than just testing and adding floats in the LuaJIT VM.

Due assignment as an expression and things like x = foo(x, x+=1) Ruby, Python and JS all need to copy x into a new VM Register when it’s used. LuaJIT can assume locals aren’t reassigned mid statement and doesn’t need copies.

link

Rochus 2071 days ago

> Lua doesn’t have assignment as an expression.

That's quite easy to achieve if you directly generate bytecode. See e.g. https://github.com/rochus-keller/som.

> Lua 5.1 has float as the only numeric type

It internaly differs between int and float.

> Ruby, Python and JS all need to copy x into a new VM Register when it’s used

Even the OpenSmalltalk VM is much faster than CPython, as well as V8.

link

fanf2 2071 days ago

Lua exposes much less of its internals than Python. For example the comment you replied to mentioned stack frames which are not exposed in Lua.

link

anonymoushn 2071 days ago

Those are exposed via the built-in debug library, including in luajit.

link

fanf2 2071 days ago

Oh whoops yes :-)

Note that you can only look up variables by their bytecode register number, not by name.

link

jashmatthews 2071 days ago

IIRC that uses the Lua C API which LuaJIT supports by fully restoring the interpreter state?

link

stephc_int13 2071 days ago

Javascript or PHP were not designed to be fast as well.

Oh, wait...

link

jerf 2071 days ago

Most languages designed in that era were not designed to be fast, and none of them were designed to be fast on 2020-era processors. The former is because this was the era of exponential CPU growth, and the latter because as good as many of these language designers were, none of them were psychic.

I'm pretty sure both Guido for Python and Larry for Perl were explicitly aware of the impossibility of designing for processors that wouldn't exist for 20 years, though digging up quotes to that effect would be quite difficult.

A mantra of that era is "There are no slow languages, only slow implementations." I, for one, consider this mantra to be effectively refuted. Even if there is a hypothetical Python interpreter/compiler/runtime/whatever that can run effectively as fast as C with no significant overhead (excepting perhaps some reasonable increase in compile time), there is no longer any reason to believe that mere mortal humans are capable of producing it, after all the effort that has been poured into trying, as document by the original link. Whatever may be true for God or superhuman AIs, for human beings, there are slow languages that build intrinsically slow operations into their base semantics.

link

robertlagrant 2071 days ago

Well, Javascript has similar dynamism and yet v8 exists. I'm not saying the investment will ever happen for Python, but I do think it's possible for humans.

link

jerf 2071 days ago

v8 is not C-fast on general code.

A lot of people seem to have the mistaken impression that v8 makes Javascript "fast". It's "fast" for a dynamic language. But on general code... it's still slow. It seems to plateau around 10x slower than C, as with the other JIT efforts to speed up dynamic languages, with a roughly 5-10x memory penalty in the process.

Microbenchmarks like the benchmark game tend to miss this because a lot of microbenchmarks focus on numeric speed. But numeric code is easy mode for a JIT. Now, that's cool, and there's nothing wrong with that. If it's the sort of code you have, great! You win. But that performance doesn't translate to general code. These are not value judgments, these are just facts about the implementation.

I expect v8 is roughly as fast as JS is going to get, and it's now news if they can eke out a .5% improvement on general code.

You can also do much better with v8 if you program in a highly restricted subset of JS that it happens to be able to JIT very well. However, this is not really the same as writing in JS. It's an undocumented subset, it's a constantly changing subset, and there's not a lot of compiler support for it (I'm not aware of anything like a "use JITableOnly" or anything).

link

robertlagrant 2061 days ago

I wasn't claiming it was C-fast. I was replying to the parent comment on "humanly possible". Apologies if that wasn't clear.

link

formerly_proven 2071 days ago

The problem is not jitting pure Python code. PyPy does that and it's quite good at doing it. The problem is exactly what mitsuhiko says, and lots of Python apps use some or all of these features implicitly (through the native extensions used by many common dependencies). Sure, some of this madness can be accessed from Python itself, but that's also the case for JS, where you can do certain things, which will slow down your code greatly because the JIT can't work with it.

link

skohan 2071 days ago

> C ABI

Why should this make python slow?

link

chrisseaton 2071 days ago

If you have to meet an existing ABI then you're constrained in how you can optimise.

link

moralsupply 2071 days ago

That's not correct. Python will never be as fast as hand-optimized assembler, but it certainly can be much (5-10x) faster that what it is right now for most workloads. Pypy is a living proof that it can be done.

link

dralley 2071 days ago

You're arguing with mitsuhiko, he's given entire talks on this subject.

https://www.youtube.com/watch?v=qCGofLIzX6g&t=31m44s

PyPy is faster for pure Python code, but that comes at the expense of having a far slower interface with C code. There's an entire ecosystem built around the fact that while Python itself is slow, it can very easily interface with native code (Numpy, Scipy, OpenCV) with very little overhead.

So sure, you can make Python much faster, if you're willing to piss off the very Python users who care the most about performance in the first place (the data science / ML people and anyone else using native extensions).

link

bastawhiz 2071 days ago

Ultimately, at least IMO, no attempt to speed up python will succeed until the issue of Python's C API is addressed. This is arguably Pypy's only major barrier: if you can't run the software on it, you're not going to use it. Pyston was arguably the most serious attempt at fast python while maintaining compatibility with the API, but DBX clearly didn't see the RoI they were hoping to.

It's looking like HPy is going to (hopefully) solve this. But finishing HPy and getting it adopted is likely to be a pretty massive undertaking.

link

travisoliphant 2071 days ago

I think this is true. I have used the Python C-API heavily having started SciPy and NumPy and Numba. I have a pre-alpha plan for addressing the C-API by introducing EPython (a typed subset of Python for extending it). It is not usable and in idea stage only, but I welcome collaborators and funders: https://github.com/epython-dev/epython. Here is a talk that describes a bit more the vision: https://morioh.com/p/6db365736476

link

seg_lol 2071 days ago

Interesting, I assume you are familiar with Terra, Titan and Pallene research languages?

I love the idea of typed base language to implement a higher level more flexible language while still being able to drop down for correctness and speed. Gradually dynamically typed, ;)

Another thing to look at is https://chocopy.org/ a typed subset of Python for teaching compilers courses. Might be worthwhile pinging Chocopy students and enticing them towards epython.

What is the semantic union and intersection between EPython and Chocopy?

[1] http://terralang.org/

[2] https://github.com/titan-lang/titan

[3] https://github.com/pallene-lang/pallene

link

Rotareti 2071 days ago

This looks interesting!

I think the approach where a typed subset of Python is used to compile a fast extension module is the way forward for Python. This would leave us with a slow but dynamic high-level-variant (CPython) and typed lower-level-variant (EPython, mypyc & co) to compile performant extension modules, which you can easily import into your CPython code.

The most prominent of such projects I know of is mypyc [0], which is already used to improve performance for mypy itself and the black [1] code formatter. I think it would be interesting to see how EPython compares to mypyc.

[0] https://github.com/python/mypy/tree/master/mypyc

[1] https://github.com/psf/black/pull/1009

link

sitkack 2071 days ago

If CPython had slowly deprecated the C API in favor of an external extension mechanism like Lua using ctypes or cffi [1], then we wouldn't be in this mess. Best time to plant a tree.

The C API is what prevents PyPy or other Python runtimes from being able to compete and interop. The community could do this, rebase Python modules with native code to cffi so that they can run in all Pythons. The C API is neither good, nor necessary and only serves to gate keep CPython's access to the rest of the Python user community.

[1] https://www.pypy.org/compat.html

link

kuu 2071 days ago

Yeah, the current CPython code base is just too big to pretend that anything else without compatibility will be highly adopted.

It's a bit like python3 from python2, it's been so slow and painful to transition because you cannot just "drop all your code" (I'm simplifying the issue).

link

intrepidhero 2071 days ago

What I really want for python is a knob to improve startup time. I've imagined there must be a way to "statically link dependencies so that import isn't searching the disk but just loading from a fixed location/file. There doesn't seem to be many resources on the net. I've found this one: https://pythondev.readthedocs.io/startup_time.html. I tried using virtualenvs to limit my searchable import paths, and messed around with cython in effort to come up with a static linked binary. But I've yet to come up with anything that really improves the startup time. Clearly I have no idea what I'm doing.

link

jakear 2071 days ago

I once got quite a bit of startup time improvement by simply swapping out cpython's malloc calls for a version that took a large amount of resources at first (~5GB, can be tuned to your workload), and allocated from that. CPython makes many many thousands of mallocs at startup so this gave significant improvement.

link

intrepidhero 2071 days ago

This is a really interesting approach. I'd love to hear more. Were you patching cpython then? Is your work online somewhere?

link

jakear 2070 days ago

See sibling. But yes I patched cpython, it’s pretty easy to build yourself.

link

kirubakaran 2071 days ago

Can you please share some numbers if you have them? How much improvement, etc.

link

jakear 2070 days ago

60% decrease in startup time, 10% improvement in general runtime. Of course if you exceed the initial allocation the whole thing will crash. That could probably be worked around with a bit more intelligence in the allocator.

This was part of a class project so not available online unfortunately. It’s good practice to implement it yourself though! There are lots of resources online for implementing fast allocators.

link

bob1029 2071 days ago

>Clearly I have no idea what I'm doing.

Doesn't look like that from over here.

Many times the difference between failure and the magic spell working is 1 more late night iteration. In this specific case you are working against some difficult constraints that are deep in the language. That said, there is almost always a way to side-step a problem altogether. You may find that one workaround is to amortize the startup concern over time - I.e. reorient the problem domain so you only have start the python process once a day. Or, find a way to defer loading of required components until the runtime actually needs them.

link

necovek 2071 days ago

It is trivial in Python to move towards a lazier way to load modules on first use, it's just not idiomatic or too readable (thus we do top-level imports).

link

jhayward 2071 days ago

If you want to improve startup for a script that you use frequently consider using one of the app builders such as PyOxidizer[1]. They do work to improve startup by embedding all the modules in the binary and then loading them from memory.

[1] https://pyoxidizer.readthedocs.io/en/latest/index.html

link

necovek 2071 days ago

What I found is that Python is not that terrible (interpreter starts in 60ms on my laptop and imports an empty local file: `touch empty.py; time python3 -c 'import empty'`).

However, idiomatic Python shortcuts to expose everything at the top level (star imports or imports of everything in the top-level __init__.py) cause everything to be imported everywhere. __all__ is all but forgotten, so importing things like flask, sqlalchemy, requests and similar will take anywhere from 100-500ms each, even if you just need a single function from a submodule.

Worst offenders are things which embed their own copy of requests (likely for reproducible builds) taking upwards of 800ms just to import even if your project already imported requests directly.

I don't think it has anything to do with search paths, but simply with loading and executing hundreds of files. If you need those modules, Python will read them. Perhaps moving your venv to a "ramdisk" might help?

link

korijn 2071 days ago

Have you established that searching for modules is slow? I think it just takes time to actually process the imported modules and load everything into memory.

link

formerly_proven 2071 days ago

On Windows (what with atrocious NTFS performance and all) an interpreter that's using a zipped library is way faster than one using loose modules.

link

intrepidhero 2071 days ago

No. In fact my experiments suggest otherwise. That was just where my intuition lead me.

link

Rotareti 2070 days ago

You can compile your Python code using Nuitka, the resulting binary has much better startup time. I do this for a couple of command line tools.

link

oweiler 2071 days ago

Couldn't a Python implementation in Truffle (GraalVM) solve the startup problem?

link

chrisseaton 2071 days ago

That's GraalPython.

link

mixmastamyk 2071 days ago

    python -s [-S]

link

joncatanio 2071 days ago

Not trying to self-promote, but this might be of interest to you. It's not a fully flushed out implementation, but my project analyzed specific language features that affect performance: https://github.com/joncatanio/cannoli

link

hydroxideOH- 2071 days ago

> Leave the features: Take the cannoli

Now that's how you title a thesis paper.

link

ramraj07 2071 days ago

Can't stop laughing at the most germane name that project could ever have.

link

joncatanio 2071 days ago

Needed something to chuckle at during my work ha!

link

sethgecko 2071 days ago

Yuri Selivanov tweeted yesterday that Python 3.10 will be "up to 10% faster" https://twitter.com/1st1/status/1318558048265404420

link

yxhuvud 2071 days ago

Wait, python doesn't have any method lookup caching before this? I would have expected that developers looked at what other similar languages are doing, but apparently not enough.

link

saeranv 2071 days ago

Am I correct that 3.10 comes after 3.9? How does that make sense, shouldn't it increase to 4.x? Is there an actual 3.1 (coming after 3.0) that this conflicts with?

link

eznzt 2071 days ago

Version numbers are not decimal numbers, they are read like the chapters of a book: 3.10 (chapter 3 section 10) comes after 3.9 (chapter 3 section 9)

link

theandrewbailey 2071 days ago

10 comes after 9, so 3.10 comes after 3.9. There's no major changes that would warrant 3.x to 4.0. It's just the 10th big release after 3.0.

Yes, there was a Python 3.1: https://www.python.org/download/releases/3.1/

link

stuaxo 2071 days ago

That's pretty good, in optimisation, 5% at a time is a good win.

link

centimeter 2071 days ago

That seems pretty small compared to the huge gap between python and basically any compiled language.

link

willseth 2071 days ago

The list should probably also include mypyc: https://github.com/python/mypy/tree/master/mypyc

link

Twirrim 2071 days ago

Another one missing from that list is Graalpython, https://github.com/graalvm/graalpython. It's in early stages of implementation, aimed at being python3 on top of GraalVM.

link

1wd 2072 days ago

One more: https://github.com/microsoft/Pyjion

link

forgotpwd16 2072 days ago

On the repo there's also a comparison[1] with some of the other implementations.

[1]: https://github.com/microsoft/Pyjion#how-do-this-compare-to-

link

sitkack 2071 days ago

I find it really interesting, that not only did they do the work of creating a JIT using the CoreCLR for CPython, they created a JIT API so that their system is augmenting CPython and not taking it over. Solid engineering.

This also means that one could implement an alternative JIT using Rust or OCaml.

https://github.com/microsoft/Pyjion#what-are-the-goals-of-th...

link

Naac 2072 days ago

This article appear to be a list of python interpreters.

Not all of these were designed for speed,l. For example jython was also intended for Java/python interoperability.

Some of the interpreters on the list haven't seen updates in a while, or don't support python 3.x

link

nknealk 2072 days ago

Numba is actually a pretty interesting project. It allows you to JIT compile a single function with a decorator. Static typing required, and it plays nice with numpy. They’ve also got some interesting stuff going on that lets you interface with nvidia GPUs as well.

Highly recommend it for anyone doing scientific computing

link

sitkack 2072 days ago

I agree, Numba is awesome for lots of reasons. The biggest advantage for everyone, the Numba team as well as its users, is that it is opt-in and done with intent. The programmer is saying, "I am willing to constrain my code to get perf". And that you can do that inside an existing runtime is pretty damn cool.

I think for Python to get decent speedups the semantics for the code being optimized needs to be highly constrained.

Optimizing full in the wild Python code is a huge huge task. Optimizing for operations over constant type arrays is much much easier.

Yes this doesn't speed up the call or the allocation rate, but start with some easy stuff or nothing will improve.

link

michelpp 2071 days ago

Numba is indeed a great library for speeding up Python and also doing other useful things when interacting when external libraries.

For example the "jit compile a single function" feature is gold when you need to pass a function callback pointer into a C library. This is how pygraphblas compiles Python functions into semiring operators that are then passed to the library which has no idea that the pointer is to jit compiled python function:

https://github.com/michelp/pygraphblas/blob/master/tests/tes...

link

weakfish 2071 days ago

Forgive my ignorance, I'm not super knowledgable on the subject but does this mean you just add decorators to existing functions with typing and it enhances the speed?

link

b5n 2071 days ago

Yes, but types not required.

  from numba import jit

  @jit
  def jitted_fn():

https://numba.readthedocs.io/en/stable/user/jit.html

link

weakfish 2070 days ago

So, does that mean Numba performs some sort of type inference on Python?

link

acomjean 2071 days ago

I tend to use Python for batch jobs and things where its speed isn't that important to me. Am I alone in this?

When I reach for python its not for speed. Its because its fairly easy to write and has some good libraries.

Either its done in a few seconds, or I can wait a few hours as it runs as a background slurm task..

I feel like there is a group that wants python to be the ideal language for all things, maybe because I'm not in love with the syntax, but I'm ok having multiple languages.

link

nemothekid 2071 days ago

Many people don't start with Python for speed. They are exactly like you - they write a script that is done in few seconds. Then the data scales, then it takes a few minutes. Then you need it to be faster, and now you either need to rewrite the script. It would be helpful if you didn't need to make this choice.

link

ufo 2072 days ago

IIRC Psyco was a precursor to PyPy. Armin Rigo was involved in both.

link

beervirus 2071 days ago

Psyco was great. Add two lines of code, suddenly everything is (at least a little, often a lot) faster.

link

xioxox 2071 days ago

It was great. It showed that it is actually possible to run fast Python from within the standard interpreter with excellent compatibility. The only downside I remember was the memory usage.

link

chrisseaton 2071 days ago

Why isn't everyone using it then?

link

beervirus 2071 days ago

Doesn’t support modern Python versions.

>12 March 2012

>Psyco is unmaintained and dead. Please look at PyPy for the state-of-the-art in JIT compilers for Python.

http://psyco.sourceforge.net/

link

arc776 2071 days ago

I gave up trying to make Python fast since to do so you give up what makes Python good and end up writing C/Cython. On top of this, distributing Python is just... gross, at least for my use cases.

Eventually I found Nim and never looked back. Python is simply not built for speed but for productivity. Nim is built for both from the start. It's certainly lacking the ecosystem of Python, but for my use cases that doesn't matter.

link

zanellia 2071 days ago

In my opinion there is some potential there. Especially exploiting the increasing integration of typing-oriented features (i.e. type annotations) and the interest in using those to carry out static analysis (e.g. in mypy, but also Facebook's Pyre and Microsoft's Pyright and many other), it might be possible to speed up execution times a bit. This is especially true if we restrict the attention to a restricted subset of Python as, e.g., within domain specific languages. It might not make sense to entirely reverse engineer a language that was designed to be duck-typed into a statically typed one. However, for some domain specific applications I find performance oriented static analysis an interesting tool.

To make it more concrete, here is an experimental DSL for embedded high-performance computing that uses static analysis and source-to-source (Python-to-C, actually) code transformation: https://github.com/zanellia/prometeo.

link

overgard 2071 days ago

I don't know much about the other ones, but I think you'd have to say PyPy has been a success. Although to be honest, I don't know why it would be better to modify CPython vs. just using PyPy -- the JIT speedup does come with some tradeoffs (memory usage, warmup times), so it seems better just to leave that decision up to the user?

link

thelazydogsback 2071 days ago

It amazes me that the stack-entwined implementation with the GIL remained the canonical version this whole time -- I would think that the Stackless version (or similar) would have been the default long-ago. This really should have made it worth it from a 2.x to 3.x version perspective, even if many people had to rewrite their extensions, and even if some monkey-patching were removed from the language in favor of more disciplined meta-programming.

link

beagle3 2071 days ago

Stackless still uses the GIL; But it avoids using the C stack most of the time, which opens the door to green threads (of which you can have a lot more than OS threads), suspending processes (dump/undump style, except portably), coroutines and more.

There were a couple of GIL-less variations, but they were either incredibly slow, or suffered serious compatibility problems (and often both).

link

zellyn 2071 days ago

Forgot one : https://github.com/google/grumpy

link

Boxxed 2071 days ago

Whatever happened to psyco? I remember it pretty much just working without any hassle and actually providing a noticeable speedup. All the mindshare is now on PyPy -- it's received enormous amounts of engineering and still seems very rough around the edges.

link

thelazydogsback 2070 days ago

psyco worked well for me at the time as well -- I remember doing something with it and pyGame, FWIR.

link

est 2071 days ago

The HotPy listed by OP is done by Mark Shannon, the same person of today's proposed 5x speedup

Also, some relevant old post:

https://news.ycombinator.com/item?id=17107047

link