Hacker News new | ask | show | jobs
by riffraff 2141 days ago
Is anyone here using pypy for their daily job? What do you use it for?
4 comments

I did. Inherited a legacy web app that did stupid things in Python in memory (basically search and aggregation).

I realized a rewrite was the best course of action, but in the meanwhile the old thing had to stay up and running, and as the volume of data increased, it started to run in to HTTP timeouts as often requests took longer than 2 minutes.

I moved the thing to PyPy, and got about a 30% speedup from that. Only one lib had to be replaced with a pure python alternative, as it was using a C extension.

It bought me enough time to finish the new implementation (duplicate the data in Elasticsearch, hey presto from over a minute to about a second to get results).

For some workloads PyPy's JIT can do wonders.

I parse big XML and similarly structured files, convert them into RDF, puff them up into a (still RDF but with a lot of blank nodes) hypergraph so I can load the content into a single database and be able to trace that these two facts are related and come from this part of document A and that part of document B.

I have document parsing and SPARQL queries that can take a few minutes that I'd like to run frequently so I can keep all parts of the system up to date.

I've only benchmarked it a bit, but I found I got approximately the five times speed-up that PyPy promised. This is with PyPy based on Python 3.6. I think PyPy is switching to cffi as the way to connect to C code so most native code "just works" now.

I had to backport my code from Python 3.8; Python 3.6 lacks contextvars, but there is a polyfill for that, otherwise there was no problem.

I stayed away from PyPy for a long time because it was tied to Python 3.5 which was busted in various ways. One of those was that the filesystem path objects were half-implemented, you should have been able to pass them into anything from the stdlib that expected a string path and at that time you couldn't. Little accidents like that can slow down a technology like PyPy from being adopted.

> I think PyPy is switching to cffi as the way to connect to C code so most native code "just works" now.

As far as I know extensions need to be written for cffi specifically.

cffi is a newer way of writing C extensions, developed by the PyPy project. It was designed to have a smaller&cleaner interface to let you call C code from Python. Here's Armin Rigo talking about it at EuroPython: https://www.youtube.com/watch?v=ejUzVcvTLgI

The CPython way of writing extensions is documented here: https://docs.python.org/3/extending/extending.html It seems to require you to deal with the internals of the CPython interpreter (deal with PyObject structs, reference counting, etc).

I know PyPy has some support for CPython extensions, but it has to emulate some internals and it's slower as a result.

Did you use `__slots__` to store data pointers on the object itself, instead of in Python's hash table (today, a Hash Array Mapped Trie)?
Don't remember the details of the legacy app, but I don't recall seeing that. I think it just used dicts for the data and stored that in a blist.sortedlist https://pypi.org/project/blist/

blist was the one C dependency which I replaced with a pure python alternative http://www.grantjenks.com/docs/sortedcontainers/

For algorithmic code PyPy can provide substantial speedups over CPython. I've used PyPy in code fingerprinting large bioinformatics files and seen big speedups. I've also tried porting a webapp processing JSON from CPython and seen no perceptible speedup.
The JSON library is probably a C-extension so PyPy won't make it any faster.
Apparently that isn't always the case. See the PyPy Status Blog: PyPy's new JSON parser https://morepypy.blogspot.com/2019/10/pypys-new-json-parser.... which talks about being more efficient with both deserialization and memory.
It's a long time since I looked, but profiling my code not much time was spent parsing / serializing JSON. Most of the time spent was manipulating dicts/lists in Python which cPython is already pretty good at since the whole language seems to basically be implemented in terms of dicsts. I don't think PyPy has the hidden class optimizations of JS engines which are able to find speedups in these types of cases.
For algorithmic/numerical code, especially if you have to deal with numpy-related data, Numba has a much easier barrier for entry, plus you remain with cpython while speeding up computation-intensive code by a few orders of magnitude.
Looks like numba has cffi support now so it would be an option. If I can dig out the code (it was about 5 years ago) I'll probably try adapting it to numba to benchmark it against pypy.
I wrote a daily utilized utility (probably still in use) that made good use of PyPy, it was pretty slow and after quick profiling I found that type check functions (PyMySQL) were being called A LOT of times. Literally changing the runtime from python3 to pypy was something like an 8x overall speedup.
We have a grid compute infrastructure for a specialized runtime environment with business-logic rules for scheduling priorities and partitioning of the compute cluster.

The control plane was implemented in Python and Twisted (event driven I/O framework for the unfamiliar), which was fit for purpose at the original scale running CPython (few thousand compute nodes).

As the number of compute nodes scaled up, we developed hotspots in ser/des of control messages, which ultimately started to affect overall cluster efficiency.

Switching to PyPy gave us an immediate substantial performance boost without really having to redo any code at all (just some FFI stuff that was probably wrongly implemented in the first place).

Eventually we realized we were going to out-scale even that (at the hundreds-of-thousands of compute node level) and ended up with a Scala/Akka reimplementation, but moving to PyPy from CPython got us a lot of free breathing room.