| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by weinzierl 3692 days ago

spaCy is another active open source (MIT) POS-tagger. In a previous discussion on HN[1] it was well received.

There is a simplified educational 200 lines python version [2] of it. It claims 96.8% for the WSJ corpus.

What am I missing here?

[1] https://news.ycombinator.com/item?id=8942783

[2] https://spacy.io/blog/part-of-speech-pos-tagger-in-python

4 comments

syllogism 3692 days ago

Those are the part-of-speech tag accuracies. spaCy's accuracy on the PTB evaluation is 92.2% --- so it makes 20% more errors than P. McP. On the other hand, spaCy is about 200x faster.

I've been watching the line of research in SyntaxNet closely, and have been steadily working on replacing spaCy's averaged perceptron model with a neural network model. This is one of the main differences between spaCy and Parser McParseface.

The key advantage of the neural network is that it lets you take advantage of training on lots and lots more text, in a semi-supervised way. In a linear model, you grow extra parameters when you do this. The neural network stays the same size --- it just gets better. So, you can benefit from reading the whole web into the neural network. This only works a little bit in the linear model, and it makes the resulting model enormous.

Another difference is that spaCy is trained on whole documents, while P. McP. is trained in the standard set-up, using gold pre-processing. I speculate this will reduce the gap between the systems in a more realistic evlauation. Of course, P. McP can do the joint training too if they choose to. I've reached out to see whether they're interested in running the experiment: https://github.com/tensorflow/models/issues/65

link

hendler 3692 days ago

Thanks for spaCy! Is that 200x with both using GPU?

link

syllogism 3692 days ago

spaCy doesn't use the GPU. Not sure what their speed is on GPU. I wouldn't be surprised if it's hard to use the GPU well for their parser, because minibatching gets complicated. Not sure.

link

hendler 3692 days ago

Also have been using spaCy with good results.

Just installed syntaxnet - tests passed in the following setup.

https://gist.github.com/Hendler/61831e411069815ee4ed490f553f...

INFO: Elapsed time: 908.048s, Critical Path: 640.26s

//syntaxnet:arc_standard_transitions_test PASSED in 0.0s

//syntaxnet:beam_reader_ops_test PASSED in 20.9s

//syntaxnet:graph_builder_test PASSED in 16.3s

//syntaxnet:lexicon_builder_test PASSED in 1.8s

//syntaxnet:parser_features_test PASSED in 0.0s

//syntaxnet:parser_trainer_test PASSED in 46.1s