Hacker News new | ask | show | jobs
by tluyben2 1877 days ago
> and is fast because the speedy parts aren't in Python.

Having worked months with a slew of senior data scientists, this was a bit painful. Python is so slow and those data scientists were very good at coming up with solutions for the issues of the company, but the implementations (using Spacy, Pandas and other libs) had enough Python in them to make them not practical for the company use case. Nice prototypes which I then had to fix them or even rewrite to C/C++(we worked Rust as well to try it out) to make them usable in the company data pipeline.

I think companies are burning millions (billions in total?) on depressingly slow solutions in this space by throwing massive power at it all to make them complete their computations before the sun dies out.

Example: we needed a specific keyword extraction algorithm for multiple languages; my colleague used Spacy and Python to create it. It took a couple of seconds per page of text; we needed max a few ms on modern hardware. He spent quite a lot of time rewriting and changing it, but never got it under 1s per page on xlarge aws instances. My version takes a few ms on average executing the same algorithm but in optimised c/c++.

Sure we could've spun up a lot more instances, but my rewrite was far cheaper than that, even in the first month.

7 comments

(I'm the creator of spaCy)

If you want to email me at matt@explosion.ai , I'd be interested in the specifics of the algorithm and why the implementation was slow.

The idea for something like that keyword extraction algorithm would be that if the Python API is slow, you should just use Cython. The Cython API of spaCy is really fast because the `Doc` is just a `TokenC*`, and the tokens just hold a pointer to their lexeme struct, which has the various attributes encoded as integers.

I've never really done a good job of teaching people to use the Cython API though. I completely agree that it's not productive to have slow solutions, and using too many libraries can be a problem. The issue is that Python loops are just too slow, you need to be able to write a loop in C/Cython/etc. Thinking through data structures is also very important.

I get very frustrated that there's this emphasis on parallelism to "solve" speed for Python. Very often the inputs and outputs of the function calls are large enough that you cannot possibly outrace the transfer, pickle and function call overheads, so the more workers you add, the slower it is. Meanwhile if you just write it properly it's 200x faster to start with, and there's no problem.

Sorry if you think I blamed spaCy for anything; it was not intended; I know it was due to the way Python was used which I tried to convey. Your product is excellent and yes, I probably should've reached out more anyway; I just know how to solve things my way and did not wanted to waste more time (there was an investor deadline).

Cython part sounds good; I will try it out and email you if I get totally stuck, thanks!

Oh, I didn't take it as a pointed criticism or anything. It just seemed like it would be an instructive example.

The underlying point I often make to people is that Python's slowness introduces a lot of incidental complexity, and you find yourself fiddling with numpy or something instead of just writing normal code and expecting it to perform normally.

But that's fine, no? I mean, it's a pretty common workflow where the people close to the science part of something write a prototype in their language/ecosystem of choice, and then the engineering side is in charge of taking the prototype implementation and making it performant enough for production use. Finding people who know both, data science, and low level programming languages well enough to be able to implement data science applications directly for production is pretty hard, I'm sure.

In either case, I much prefer prototypes in Python than, say, Matlab. To speed things up I once rewrote an internal Scipy function to a version that allowed me to use it in vectorized code on my end. If the prototype is in Matlab, the optimization and integration possibilities are much more limited due to licensing, toolboxes, and the closed ecosystem in general.

Also I think it is good to be able to use the python code as testcases/validation on smaller datasets for the C code.
Yes, I guess it is fine if that is the flow. I just didn't expect it upfront (my bad).
Yeah, if it's actually OK or not depends a lot on the particulars. Like, if it's not actually your job, and the data people were supposed to produce production ready stuff themselves, and then you have to go out of your way to actually make it work, then it's not OK. But that's more to do with how organizations function, not technical merits of the involved programming languages.
Python performance is something that I see about 15-20x as much in discussions about python than I do wrestling with real life problems.
Agreed. The only time in practice (working on datasets consisting of millions of rows) that Python has been too slow was when I was taking courses in college and their online code thing timed out on some specific graph problems. I rewrote the algo in C# without any fuss
All analysis we run is over 100s of millions of objects (objects can be a few megabytes large) every time it runs. The slightest increase or decrease for 1 object obviously makes a huge difference overall; either in cost, time or both.
Absolutely, but I don't feel the vast majority of corporations are doing that type of computation. I feel that even with hundreds of millions of rows Python can be a great solution (I have done multiple projects generating fairly complex projections from a few hundred million rows) for most projects.
Nice prototypes which I then had to fix them or even rewrite to C/C++

Even as someone who 'knows' C and C++ I still find it faster and easier overall to do the exploratory and 'science' part in Python, making sure it works and gives me the answers I need in the format I want etc. only to then rewrite and optimize the slow parts in C or C++ if necessary.

You could probably use something like Chez Scheme for that, especially the new Racket fork of it with unboxed flonums and flonum vectors, except with significantly decreased need to rewrite stuff in C for performance. Also with proper threads to boot.
You're missing the point. What makes python 'fast' is that it comes with 'out of the box' support for everything I might need. Reading and writing obscure file formats. Every sparse matrix, image processing and graph algorithm I might need has already been implemented. Do I need to all of sudden solve an optimization problem or a differential equation? Already nicely integrated with the library using. If I need some obscure domain specific algorithm, there is almost certainly a library for that already. I don't have to worry about any of that stuff and can focus on solving my problem.
There's quite a few C/C++ libraries for those things, right? Integrating them with proper FFI (of the kind that Chez has, for example - or LuaJIT, for that matter) seems hardly difficult...although differential equations or optimization problems may be better served without any language interfaces whatsoever since they involve functional parameterization. That generally sucks with mixed-language solutions, to the extent that GSLShell, the LuaJIT interface to GSL, completely skipped the DE solvers in GSL and reimplemented them in Lua for higher performance. I imagine that 'fast' in case of Python really needs to be quoted the way you did.
"seems hardly difficult" and "could probably use" is seldom faster than "out of the box" and "already nicely integrated"
Maybe, but the latter definitely isn't the case with Python, especially in case of PyPy (or at least it wasn't the case the last time I tried that), so there's that.
Have you considered how much time the "senior data scientists" saved, and how much better algorithmic solution they were able to develop by being able to iteratively explore the problem space and refine the algorithm in a convenient-to-them environment?

This development process allowed creation of a good solution that you were then able to quickly port to a performant production platform.

I've worked in this industry for a bit and had the opposite experience. Most of the slow solutions have been in other languages due to poor design and algo choice. There have been several projects I've been able to rewrite from R to Python where the runtime on a workstation went from days with R to seconds minutes with Python in a tiny tiny VM (like $10 DO box).

Sure maybe a 3 minute task in python that reconciles a few million transactions and builds some very useful projections is too slow for some pipelines, but it worked for my clients.

So the R code was pretty bad and your solution was more optimal - that's the only bit of information in here. The point is that rewriting something means you have extra domain knowledge, bottleneck knowledge, etc that you didn't have during the initial write. Tt might have gone the other way too, initial write in Python is too slow, rewrite in R is faster.
Oh yeah, it definitely was - my point wasn't that R is slower than Python. My point is the only time I've encountered "slow apps" in practice was when they were written in a much faster language, and rewriting them in a slower language resulted in really good performance.
The vast majority of the time that I've seen slow Python code, it's because people are not leveraging the libraries in the right way, e.g. by using groupby-apply in Pandas or not vectorising while using NumPy.

I can't speak as to the specific use cases that you've encountered, but performance wise, I have found Python to be a fine choice for several ML services.