| > and is fast because the speedy parts aren't in Python. Having worked months with a slew of senior data scientists, this was a bit painful. Python is so slow and those data scientists were very good at coming up with solutions for the issues of the company, but the implementations (using Spacy, Pandas and other libs) had enough Python in them to make them not practical for the company use case. Nice prototypes which I then had to fix them or even rewrite to C/C++(we worked Rust as well to try it out) to make them usable in the company data pipeline. I think companies are burning millions (billions in total?) on depressingly slow solutions in this space by throwing massive power at it all to make them complete their computations before the sun dies out. Example: we needed a specific keyword extraction algorithm for multiple languages; my colleague used Spacy and Python to create it. It took a couple of seconds per page of text; we needed max a few ms on modern hardware. He spent quite a lot of time rewriting and changing it, but never got it under 1s per page on xlarge aws instances. My version takes a few ms on average executing the same algorithm but in optimised c/c++. Sure we could've spun up a lot more instances, but my rewrite was far cheaper than that, even in the first month. |
If you want to email me at matt@explosion.ai , I'd be interested in the specifics of the algorithm and why the implementation was slow.
The idea for something like that keyword extraction algorithm would be that if the Python API is slow, you should just use Cython. The Cython API of spaCy is really fast because the `Doc` is just a `TokenC*`, and the tokens just hold a pointer to their lexeme struct, which has the various attributes encoded as integers.
I've never really done a good job of teaching people to use the Cython API though. I completely agree that it's not productive to have slow solutions, and using too many libraries can be a problem. The issue is that Python loops are just too slow, you need to be able to write a loop in C/Cython/etc. Thinking through data structures is also very important.
I get very frustrated that there's this emphasis on parallelism to "solve" speed for Python. Very often the inputs and outputs of the function calls are large enough that you cannot possibly outrace the transfer, pickle and function call overheads, so the more workers you add, the slower it is. Meanwhile if you just write it properly it's 200x faster to start with, and there's no problem.