Hacker News new | ask | show | jobs
by xigency 3692 days ago
Evidence that this is the most accurate parser is here; the previous approach mentioned is a March 2016 paper, "Globally Normalized Transition-Based Neural Networks," http://arxiv.org/abs/1603.06042

"On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old Penn Treebank), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already better than any previous approach."

From the original paper, "Our model achieves state-of-the-art accuracy on all of these tasks, matching or outperforming LSTMs while being significantly faster. In particular for dependency parsing on the Wall Street Journal we achieve the best-ever published unlabeled attachment score of 94.41%."

This seems like a narrower standard than described, specifically being better at parsing the Penn Treebank than the best natural language parser for English on the Wall Street Journal.

The statistics listed on the project GitHub actually contradict these claims by showing the original March 2016 implementation has higher accuracy than Parsey McParseface.

5 comments

spaCy is another active open source (MIT) POS-tagger. In a previous discussion on HN[1] it was well received.

There is a simplified educational 200 lines python version [2] of it. It claims 96.8% for the WSJ corpus.

What am I missing here?

[1] https://news.ycombinator.com/item?id=8942783

[2] https://spacy.io/blog/part-of-speech-pos-tagger-in-python

Those are the part-of-speech tag accuracies. spaCy's accuracy on the PTB evaluation is 92.2% --- so it makes 20% more errors than P. McP. On the other hand, spaCy is about 200x faster.

I've been watching the line of research in SyntaxNet closely, and have been steadily working on replacing spaCy's averaged perceptron model with a neural network model. This is one of the main differences between spaCy and Parser McParseface.

The key advantage of the neural network is that it lets you take advantage of training on lots and lots more text, in a semi-supervised way. In a linear model, you grow extra parameters when you do this. The neural network stays the same size --- it just gets better. So, you can benefit from reading the whole web into the neural network. This only works a little bit in the linear model, and it makes the resulting model enormous.

Another difference is that spaCy is trained on whole documents, while P. McP. is trained in the standard set-up, using gold pre-processing. I speculate this will reduce the gap between the systems in a more realistic evlauation. Of course, P. McP can do the joint training too if they choose to. I've reached out to see whether they're interested in running the experiment: https://github.com/tensorflow/models/issues/65

Thanks for spaCy! Is that 200x with both using GPU?
spaCy doesn't use the GPU. Not sure what their speed is on GPU. I wouldn't be surprised if it's hard to use the GPU well for their parser, because minibatching gets complicated. Not sure.
Also have been using spaCy with good results.

Just installed syntaxnet - tests passed in the following setup.

https://gist.github.com/Hendler/61831e411069815ee4ed490f553f...

INFO: Elapsed time: 908.048s, Critical Path: 640.26s

//syntaxnet:arc_standard_transitions_test PASSED in 0.0s

//syntaxnet:beam_reader_ops_test PASSED in 20.9s

//syntaxnet:graph_builder_test PASSED in 16.3s

//syntaxnet:lexicon_builder_test PASSED in 1.8s

//syntaxnet:parser_features_test PASSED in 0.0s

//syntaxnet:parser_trainer_test PASSED in 46.1s

//syntaxnet:reader_ops_test PASSED in 5.7s

//syntaxnet:sentence_features_test PASSED in 0.0s

//syntaxnet:shared_store_test PASSED in 0.5s

//syntaxnet:tagger_transitions_test PASSED in 0.0s

//syntaxnet:text_formats_test PASSED in 1.7s

//util/utf8:unicodetext_unittest PASSED in 0.0s

Some other notes:

Also using Keras with Theano. Before spaCy, StanfordNLP, Freeling, and/or NLTK.

spaCy's 96.8% accuracy is for the task of POS tagging while Google's reported 94% accuracy is for dependency parsing, a significantly harder problem.
Thanks (also to charlieegan3) this explains the difference nicely.

Actually I found an educational 500 lines python parser from the same author as well [1]. It claims an accuracy of 92.7 for the WSJ corpus.

[1] https://spacy.io/blog/parsing-english-in-python

Well, it seems this page could use some work: http://www.aclweb.org/aclwiki/index.php?title=Dependency_Par...
spaCy also has a dependency parser - looks like this blog post is just talking about the POS-tagger.

    better at parsing the Penn Treebank than the best
    natural language parser for English on the Wall
    Street Journal
I'm pretty sure "the 20 year old Penn Treebank" and "the Wall Street Journal" are referring to the same dataset here. In the early 1990s the first large treebanking efforts were on a corpus from the WSJ, and they were released as the Penn Treebank: https://catalog.ldc.upenn.edu/LDC95T7 People report results on this dataset because that's what the field has been testing on (and overfitting to) for decades.

(I worked on a successor project, OntoNotes, that involved additional treebank annotation on broader corpora: https://catalog.ldc.upenn.edu/LDC2013T19)

Yes, the press release is (actually) pretty difficult to parse and really opaque in how the comparison is measured, which is why I wanted to throw into question the blog's headline, "The World's Most Accurate Parser." It seems more clear now but obviously Google doesn't feel the need to overtly prove that they are the best in the world at tasks, which is a bit questionable considering their number of followers. In all, it seems they have tested against several other dependency parsers, but clearly not all of them, and it's fair to say that it is "highly accurate," but this parser still falls victim to some of the same issues that most statistical parsers do, and while faster than some dependency parsers, it is not faster than all of them.

The point about overfitting is valid, too, which is another reason why this "most accurate such model in the world" claim is obnoxious.

It's also fair to note that their advance is in fractions of percentage points on this specific dataset over models that are 5-10 years older.

> The statistics listed on the project GitHub actually contradict these claims by showing the original March 2016 implementation has higher accuracy than Parsey McParseface.

So you're referring to this LSTM?

"Andor et al. (2016)* is simply a SyntaxNet model with a larger beam and network. For futher information on the datasets, see that paper under the section "Treebank Union"."

After spending a few months hand coding a NLP parser, am rather intrigued by LSTM. I like the idea of finding coefficients, as opposed to juggling artificial labels.

Yes, my mistake. Their claim is that SyntaxNet (originally described in the paper and improved over one month) is the best in field, whereas Parsey McParseface is just one trained instance.
> SyntaxNet improved over one month ... whereas Parsey McParseface is just one trained instance

Cool. I wonder how much how much Human effort (vs machine time) went into tweaking SyntaxNet versus tweaking Parsey M.?

Coincidentally, I had a parent/teacher conference with my 1st grader's teacher yesterday afternoon. Regarding reading level & comprehension, she remarked that current research indicates anything below about 98% comprehension isn't sufficient for reading "fluency". Before the past few years, the standard was 95% comprehension = fluency, but that extra few percentage points apparently make an enormous difference (probably because of colloquial & jargon edge case usages that carry specific meanings in specific contexts, but which aren't easy to programmatically detect, but that's just my supposition).
Sorry, but that just doesn't make any sense to me. Practically 70% seems like enough to understand most narrative. I've read some really difficult texts (translated German theology) and for anything of meaningful complexity 98% is unreachable without a huge vocabulary and understanding of both oddities of grammar and the construction of narrative or argument.
The paper you mention is the world's best results and is macparseface with broader beam search and more hidden layers.

This is an opensourcing of the March 2016 method (syntaxnet, note that in the paper there are results from several trained models) as well as a trained model that is comparable in performance but faster (macparseface).

It is very hard to separate those two things from the way they write.