| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by syllogism 1877 days ago

(I'm the creator of spaCy)

If you want to email me at matt@explosion.ai , I'd be interested in the specifics of the algorithm and why the implementation was slow.

The idea for something like that keyword extraction algorithm would be that if the Python API is slow, you should just use Cython. The Cython API of spaCy is really fast because the `Doc` is just a `TokenC*`, and the tokens just hold a pointer to their lexeme struct, which has the various attributes encoded as integers.

I've never really done a good job of teaching people to use the Cython API though. I completely agree that it's not productive to have slow solutions, and using too many libraries can be a problem. The issue is that Python loops are just too slow, you need to be able to write a loop in C/Cython/etc. Thinking through data structures is also very important.

I get very frustrated that there's this emphasis on parallelism to "solve" speed for Python. Very often the inputs and outputs of the function calls are large enough that you cannot possibly outrace the transfer, pickle and function call overheads, so the more workers you add, the slower it is. Meanwhile if you just write it properly it's 200x faster to start with, and there's no problem.

1 comments

tluyben2 1877 days ago

Sorry if you think I blamed spaCy for anything; it was not intended; I know it was due to the way Python was used which I tried to convey. Your product is excellent and yes, I probably should've reached out more anyway; I just know how to solve things my way and did not wanted to waste more time (there was an investor deadline).

Cython part sounds good; I will try it out and email you if I get totally stuck, thanks!

link

syllogism 1877 days ago

Oh, I didn't take it as a pointed criticism or anything. It just seemed like it would be an instructive example.

The underlying point I often make to people is that Python's slowness introduces a lot of incidental complexity, and you find yourself fiddling with numpy or something instead of just writing normal code and expecting it to perform normally.

link