Hacker News new | ask | show | jobs
by weinzierl 3692 days ago
spaCy is another active open source (MIT) POS-tagger. In a previous discussion on HN[1] it was well received.

There is a simplified educational 200 lines python version [2] of it. It claims 96.8% for the WSJ corpus.

What am I missing here?

[1] https://news.ycombinator.com/item?id=8942783

[2] https://spacy.io/blog/part-of-speech-pos-tagger-in-python

4 comments

Those are the part-of-speech tag accuracies. spaCy's accuracy on the PTB evaluation is 92.2% --- so it makes 20% more errors than P. McP. On the other hand, spaCy is about 200x faster.

I've been watching the line of research in SyntaxNet closely, and have been steadily working on replacing spaCy's averaged perceptron model with a neural network model. This is one of the main differences between spaCy and Parser McParseface.

The key advantage of the neural network is that it lets you take advantage of training on lots and lots more text, in a semi-supervised way. In a linear model, you grow extra parameters when you do this. The neural network stays the same size --- it just gets better. So, you can benefit from reading the whole web into the neural network. This only works a little bit in the linear model, and it makes the resulting model enormous.

Another difference is that spaCy is trained on whole documents, while P. McP. is trained in the standard set-up, using gold pre-processing. I speculate this will reduce the gap between the systems in a more realistic evlauation. Of course, P. McP can do the joint training too if they choose to. I've reached out to see whether they're interested in running the experiment: https://github.com/tensorflow/models/issues/65

Thanks for spaCy! Is that 200x with both using GPU?
spaCy doesn't use the GPU. Not sure what their speed is on GPU. I wouldn't be surprised if it's hard to use the GPU well for their parser, because minibatching gets complicated. Not sure.
Also have been using spaCy with good results.

Just installed syntaxnet - tests passed in the following setup.

https://gist.github.com/Hendler/61831e411069815ee4ed490f553f...

INFO: Elapsed time: 908.048s, Critical Path: 640.26s

//syntaxnet:arc_standard_transitions_test PASSED in 0.0s

//syntaxnet:beam_reader_ops_test PASSED in 20.9s

//syntaxnet:graph_builder_test PASSED in 16.3s

//syntaxnet:lexicon_builder_test PASSED in 1.8s

//syntaxnet:parser_features_test PASSED in 0.0s

//syntaxnet:parser_trainer_test PASSED in 46.1s

//syntaxnet:reader_ops_test PASSED in 5.7s

//syntaxnet:sentence_features_test PASSED in 0.0s

//syntaxnet:shared_store_test PASSED in 0.5s

//syntaxnet:tagger_transitions_test PASSED in 0.0s

//syntaxnet:text_formats_test PASSED in 1.7s

//util/utf8:unicodetext_unittest PASSED in 0.0s

Some other notes:

Also using Keras with Theano. Before spaCy, StanfordNLP, Freeling, and/or NLTK.

spaCy's 96.8% accuracy is for the task of POS tagging while Google's reported 94% accuracy is for dependency parsing, a significantly harder problem.
Thanks (also to charlieegan3) this explains the difference nicely.

Actually I found an educational 500 lines python parser from the same author as well [1]. It claims an accuracy of 92.7 for the WSJ corpus.

[1] https://spacy.io/blog/parsing-english-in-python

Well, it seems this page could use some work: http://www.aclweb.org/aclwiki/index.php?title=Dependency_Par...
spaCy also has a dependency parser - looks like this blog post is just talking about the POS-tagger.