Hacker News new | ask | show | jobs
by pqwEfkvjs 3145 days ago
Kudos to Matthew, Ines and others making this possible.

I haven't checked it out myself yet, so I wanted to ask that are the performance issues fixed that were haunting the 2.0 alpha version?

2 comments

Current discussion: https://github.com/explosion/spaCy/issues/1508

I'm getting around 8k words per second on the smallest Google Cloud instances. You couldn't run spaCy 1 on these instances (or on AWS lambda) due to memory usage problems, especially problems predicting memory usage for long-running processes. This is why we say spaCy 2 is cheaper to run in a cents-per-word sense than spaCy 1. This is the performance measure that we think is most important.

However, users are still reporting performance problems, so I wouldn't call the issue resolved. spaCy 1 managed to avoid depending on numpy during prediction, making it easy to ensure that performance didn't depend on anyone's environment. spaCy 2 currently does use numpy, introducing these questions around configuration. I'm working to fix this by implementing the forward pass entirely in Cython.

Found the answer myself from the release docs: > The Language.pipe method allows spaCy to batch documents, which brings a significant performance advantage in v2.0. The new neural networks introduce some overhead per batch, so if you're processing a number of documents in a row, you should use nlp.pipe and process the texts as a stream.

So if you have an event based system where you can process only a single document at once, it does not make sense to upgrade yet, because for a single document case the runtime performance was 10x-100x slower, at least with 2.0 alpha version.

But with a nice caveat: In an event-based system, you can run spaCy 2 with AWS Lambda :). This will be much cheaper than keeping a server warm.