Hacker News new | ask | show | jobs
by jweese 4783 days ago
Nice writeup. A few comments:

So you're just identifying NPs and VPs in a sentence? So lets say I run your program, and I get NPs "Instagram" and "Facebook", and the VP "acquired." The question is, who did what to whom? Did Facebook acquire Instagram, or did Instagram acquire Facebook?

Second, I think you're way over-emphasizing the supposed slowness of CFG parsing. Yes, the complexity is O(n^3) in the length of the sentence, but in practice, n is usually small. Modern statistical PCFG parsers are fast.

2 comments

My understanding is different, so please correct me / supply missing information.

From what I understand, the average length of a typical written sentence is n = ~27. OK, this is small by itself, but Stanford Parser (lexicalized PCFG) I am using needs about 1 second to parse a sentence of this size. Imagine how slow that is on a computer time-scale by comparing it to string-length operation on the same sentence.

I do encounter many sentences that are as much as 100 words long (and reading them myself, find nothing wrong with them). At about four times the length, these take about a minute to parse!

I am trying to find information about speeds of other PCFG parsers, including Collins, Charniak, Berkeley, etc. I understand dependency parsers are faster but also generally lag in accuracy.

The Stanford parser is particularly slow --- it's in java, and it's written for research more than anything. The C&C CCG parser runs at about 60-80 sentences a second, although it gives either CCG constituents or dependencies -- so the output may take some interpretation.

Shift-reduce dependency parsers are linear time, and are giving state-of-the-art results. My parser's currently a pain in the ass to install, as it hasn't really been released yet, but it does hundreds of sentences a second. Accuracy is state-of-the-art -- 92-93% depending on the beam width and the evaluation set (Stanford or MALT dependencies).

https://github.com/syllog1sm/redshift/ . You'll want the develop branch. It's GPL licensed.

It's implemented in Cython (i.e., almost all the code is Cython --- I'm not using it just for the speed critical bits), which would make it easy to work with if you're using Python. But, as I said...I don't claim it's currently fit for human consumption.

A C++ implementation of the same algorithm is here: http://www.sutd.edu.sg/yuezhang.aspx . Note his papers too -- he did some of the important work on this line of research.

The last few years of work in shift-reduce dependency parsing have been a bit of a break-through in parsing, imo.

The Stanford parser is particularly slow --- it's in java, and it's written for research more than anything.

The choice of language is not what causes the slowness of the Stanford parser. It's the choice of search strategy, which trades-off speed for accuracy.

Shift-reduce dependency parsers are linear time, and are giving state-of-the-art results.

No, this is incorrect.

The choice of parsing logic (shift-reduce, dependency) and the search strategy (greedy, sometimes erroneously called "deterministic") are orthogonal. It's the greedy search strategy that leads to linear time performance.

The choice of logic determines the lower-bound (best-case) on parsing complexity. If you do exhaustive search for the exact solution of a shift-reduce dependency parser, it is worst-case exponential. In practice, you don't do exact search, and by using a beam search approximation you can get observed linear-time performance.

[edit: You can read my thesis if you are not familiar with what a parsing logic is.]

I am not aware of state-of-the-art results from greedy shift-reduce parsers. Do you mind sharing?

> The choice of language is not what causes the slowness of the Stanford parser. It's the choice of search strategy, which trades-off speed for accuracy.

Even amongst chart parsers it's not a very quick implementation.

> The choice of parsing logic (shift-reduce, dependency) and the search strategy (greedy, sometimes erroneously called "deterministic") are orthogonal. It's the greedy search strategy that leads to linear time performance.

They're really only conceptually orthogonal, because in practice once you choose shift-reduce you're always going to choose greedy/deterministic/whatever search. I'm not aware of any shift-reduce/transition-based parsing results that don't use 1-best or beam search.

> I am not aware of state-of-the-art results from greedy shift-reduce parsers. Do you mind sharing?

Zhang and Nivre (2011) http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.p... 93.5 UAS on Stanford basic dependencies. This is the best published result on the dataset, and the best published dependency parsing result for English.

Probably the parser is still worse than the C&J reranking parser, when 200 parses are supplied to the reranker. But I'd be very surprised if it wasn't better than the Stanford parser, and probably also the Berkeley parser.

Accuracy is state-of-the-art -- 92-93% depending on the beam width and the evaluation set (Stanford or MALT dependencies).

I assume that this is for English? A former colleague of mine compared two statistical dependency parsers (Malt and MST) to a rule-based dependency parser with a maxent disambiguation model, for Dutch. The rule-based system outperforms the statistical dependency parsers by a wide margin, both in-domain and out-of-domain:

http://dl.acm.org/citation.cfm?id=1870171

Nonetheless, I find work on statistical dependency parsing to be very exiting, since it is fast and requires far less human effort :).

https://github.com/syllog1sm/redshift/ . You'll want the develop branch. It's GPL licensed.

Very nice work!

Yes, for English. There's a standard multi-lingual evaluation for statistical dependency parsers (the CoNLL 2007 data), but none for constituency parsers.

I had a look at that paper, but didn't read it carefully. All I can really say is that there's a real evaluation problem between rule-based and statistical parsers. Rule-based parsers recover richer representations, but tend to have lower coverage over arbitrary data --- they normally can't guarantee that a parse is returned; they may deem the sentence ungrammatical.

In that paper, the parsers were evaluated on "home ground" for the rule-based parser, as they used the treebank created for it. Having worked on the CCG formalism through my PhD, I can say that even small differences in annotation scheme can make a big difference in which parsers come out ahead.

Rule-based parsers recover richer representations, but tend to have lower coverage over arbitrary data

This parser usually has a coverage around ~95-97%. In one experiment we also parsed Flemish (which uses some constructions that wouldn't be considered grammatical in Dutch) and obtained a coverage of ~94%.

It also has a robustness component that attempts to provide an analysis for as many constituents as possible if no fully spanning parse can be found.

Having worked on the CCG formalism through my PhD, I can say that even small differences in annotation scheme can make a big difference in which parsers come out ahead.

Certainly. But you are not mentioning the other elephant in the room: Dutch is a free word order language and also permits very liberal ordering in the middle field. The rule-based grammar may benefit from the detailed constraints in the lexical attribute-value structures.

> Certainly. But you are not mentioning the other elephant in the room: Dutch is a free word order language and also permits very liberal ordering in the middle field. The rule-based grammar may benefit from the detailed constraints in the lexical attribute-value structures.

Mostly I didn't want to stick my neck out :p. It's easy to say something untrue about a language you don't know and haven't worked with.

I'd also be reluctant to assume which mechanisms were making a difference, because it's so hard to guess what cases are frequent and not easily inferred by a statistical model. One thing we can know is that the transition based parsers are best at producing projective dependency trees. The various techniques for non-projective shift-reduce parsing aren't very good.

Don't get me wrong --- I totally think the suggestion you offered makes sense, and it's a likely explanation. It's just that this stuff all very tricky.

I have the impression that the average sentence length is more on the order of 15 words per sentence, but I might be wrong.

The cubic time complexity is for exhaustively finding the best parse. In practice you can use various approximation techniques, such as coarse-to-fine parsing used by the Charniak & Berkeley parsers. I believe these two are faster than the Stanford Parser, and the parameters of the approximation can be tuned to have faster parsing at the expense of accuracy. You could probably get a reasonable parse tree for a sentence in one second. The fastest PCFG parser which does not do such approximations is bitpar. Another avenue is to try to use the GPU; there has been some research into this.

Dependency parsers are indeed faster but in general their accuracy cannot be compared to constituency parsers, because the information in the output is of a different nature.

I have the impression that the average sentence length is more on the order of 15 words per sentence, but I might be wrong.

It highly depends on the nature of the material. But, e.g. in Dutch based on samples of newspapers I found that the average is between 16 and 20 tokens.

The grandparent mentions sentences of 100 word (I take that he means tokens). I'd guess that such sentences usually contain one or more dependent clauses, that can be parsed separately if necessary.

In practice you can use various approximation techniques, such as coarse-to-fine parsing used by the Charniak & Berkeley parsers.

Indeed, there are many possible optimizations. Such as: restricting the number of lexical analyses using a part-of-speech tagger (particularly useful in highly lexicalized grammars), guided parsing (e.g. by filtering partial left-corner splines that never lead to a good parse), and beam search.

Dependency parsers are indeed faster

Purely statistical dependency parsers are faster. There are also dependency parsers that use (rule-based) unification grammars. Such parsers can produce dependency structure as a side-effect or even post-processing step and rely on relatively expensive unification of attribute-value structures.

>> The grandparent mentions sentences of 100 word (I take that he means tokens).

Correct. Though I have seen valid (readily human-readable) sentences even longer at 140 tokens, so the number of words too can reach or exceed 100 more frequently than commonly assumed.

>> I'd guess that such sentences usually contain one or more dependent clauses, that can be parsed separately if necessary.

Absolutely. Often more than one independent clause and several dependent clauses. But I am not aware how to identify these and parse them separately. Can you please shed some light? Are there for example some simpler grammars available that do not need to do the full parse to identify these clauses?

I haven't tried such a thing (yet), but there is some work in that area. Advaith Siddhartan's thesis may be a good start:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.8...

Thanks!

Here is a study on average sentence length showing it at 24 words per sentence: http://ds.nahoo.net/Academic/Maths/Sentence.html

In my own study, I performed analysis on a corpus of a few hundred million sentences of written text and found it at 31 words per sentence.

It would be great if you can point to the use of GPU for parsing.

I had never heard of the bitpar parser, will look into it.

My average sentence length of 15 came from a bunch of literary novels. The non-literary fiction had even less, at 11 words. There is obviously quite some variance here, but actually it shouldn't matter for a parser because humans appear to parse in linear time (given a well written sentence).
Re: your first point.

In English, the first noun is almost always the actor of the verb, with subsequent nouns being acted upon. This isn't true for other languages though.

(I'm sure someone will come up with an exception to that rule but for 90% of the cases this is true).

Except in passive constructions, but maybe that's an easy case to detect.