| HN Mirror

I gave some thoughts earlier in the thread about how we see spaCy fitting in to the NLP ecosystem. Here’s another answer that’s maybe more direct, through an example.

Let’s say you wanted to build a system where you used an efficient bag-of-words text classifier to select paragraphs that might have information of interest, and then you wanted to run an entity recognizer and recognise relation triples between predicates and pairs of entities. When extracting the triples, you want to use the lemma of the relation word, so you want to map “dove” to “dive” when it’s a verb etc. It’s possible to build this system directly using PyTorch modules for the various model parts, but you’ll need to write the various bits of logic to string together the model predictions yourself, and for tasks like lemmatization that are pretty easy, you’ll struggle to find existing systems that you’ll actually want to use. A lemmatization system that’s published for PyTorch will probably be designed for languages where lemmatization is really hard, but for English it’s really easy.

spaCy has a good architecture and API for this system level stuff, where you’re putting together models into practical solutions. It also has a Doc object that makes it really easy to actually work with the system outputs, especially to relate multiple levels of annotation to each other.

Partly because orchestrating a number of models is kind of a hassle in lower-level frameworks, a lot of guides will encourage you to take entirely joint approaches to this type of system. In theory you can bypass problems like the lemmatization and NER if you take a sequence-to-sequence approach, and generate the relation triples as just arbitrary data. But this has a lot of limitations. It’s difficult to express structural constraints that you know should hold about the triples, the system will be much much slower, and might require vastly more training data. It’s also difficult to divide the task up between different people, and it’ll be difficult to analyse the system errors, iterate on individual parts, or inject rule logic to ensure certain invariants about the output. All these facts about joint approaches increase the risk of the project failing; they add large uncertainties that keep projects from getting out of the prototype stage.