Hacker News new | ask | show | jobs
by rciorba 2139 days ago
I did. Inherited a legacy web app that did stupid things in Python in memory (basically search and aggregation).

I realized a rewrite was the best course of action, but in the meanwhile the old thing had to stay up and running, and as the volume of data increased, it started to run in to HTTP timeouts as often requests took longer than 2 minutes.

I moved the thing to PyPy, and got about a 30% speedup from that. Only one lib had to be replaced with a pure python alternative, as it was using a C extension.

It bought me enough time to finish the new implementation (duplicate the data in Elasticsearch, hey presto from over a minute to about a second to get results).

For some workloads PyPy's JIT can do wonders.

2 comments

I parse big XML and similarly structured files, convert them into RDF, puff them up into a (still RDF but with a lot of blank nodes) hypergraph so I can load the content into a single database and be able to trace that these two facts are related and come from this part of document A and that part of document B.

I have document parsing and SPARQL queries that can take a few minutes that I'd like to run frequently so I can keep all parts of the system up to date.

I've only benchmarked it a bit, but I found I got approximately the five times speed-up that PyPy promised. This is with PyPy based on Python 3.6. I think PyPy is switching to cffi as the way to connect to C code so most native code "just works" now.

I had to backport my code from Python 3.8; Python 3.6 lacks contextvars, but there is a polyfill for that, otherwise there was no problem.

I stayed away from PyPy for a long time because it was tied to Python 3.5 which was busted in various ways. One of those was that the filesystem path objects were half-implemented, you should have been able to pass them into anything from the stdlib that expected a string path and at that time you couldn't. Little accidents like that can slow down a technology like PyPy from being adopted.

> I think PyPy is switching to cffi as the way to connect to C code so most native code "just works" now.

As far as I know extensions need to be written for cffi specifically.

cffi is a newer way of writing C extensions, developed by the PyPy project. It was designed to have a smaller&cleaner interface to let you call C code from Python. Here's Armin Rigo talking about it at EuroPython: https://www.youtube.com/watch?v=ejUzVcvTLgI

The CPython way of writing extensions is documented here: https://docs.python.org/3/extending/extending.html It seems to require you to deal with the internals of the CPython interpreter (deal with PyObject structs, reference counting, etc).

I know PyPy has some support for CPython extensions, but it has to emulate some internals and it's slower as a result.

Did you use `__slots__` to store data pointers on the object itself, instead of in Python's hash table (today, a Hash Array Mapped Trie)?
Don't remember the details of the legacy app, but I don't recall seeing that. I think it just used dicts for the data and stored that in a blist.sortedlist https://pypi.org/project/blist/

blist was the one C dependency which I replaced with a pure python alternative http://www.grantjenks.com/docs/sortedcontainers/