Hacker News new | ask | show | jobs
by andreasvc 4783 days ago
I have the impression that the average sentence length is more on the order of 15 words per sentence, but I might be wrong.

The cubic time complexity is for exhaustively finding the best parse. In practice you can use various approximation techniques, such as coarse-to-fine parsing used by the Charniak & Berkeley parsers. I believe these two are faster than the Stanford Parser, and the parameters of the approximation can be tuned to have faster parsing at the expense of accuracy. You could probably get a reasonable parse tree for a sentence in one second. The fastest PCFG parser which does not do such approximations is bitpar. Another avenue is to try to use the GPU; there has been some research into this.

Dependency parsers are indeed faster but in general their accuracy cannot be compared to constituency parsers, because the information in the output is of a different nature.

2 comments

I have the impression that the average sentence length is more on the order of 15 words per sentence, but I might be wrong.

It highly depends on the nature of the material. But, e.g. in Dutch based on samples of newspapers I found that the average is between 16 and 20 tokens.

The grandparent mentions sentences of 100 word (I take that he means tokens). I'd guess that such sentences usually contain one or more dependent clauses, that can be parsed separately if necessary.

In practice you can use various approximation techniques, such as coarse-to-fine parsing used by the Charniak & Berkeley parsers.

Indeed, there are many possible optimizations. Such as: restricting the number of lexical analyses using a part-of-speech tagger (particularly useful in highly lexicalized grammars), guided parsing (e.g. by filtering partial left-corner splines that never lead to a good parse), and beam search.

Dependency parsers are indeed faster

Purely statistical dependency parsers are faster. There are also dependency parsers that use (rule-based) unification grammars. Such parsers can produce dependency structure as a side-effect or even post-processing step and rely on relatively expensive unification of attribute-value structures.

>> The grandparent mentions sentences of 100 word (I take that he means tokens).

Correct. Though I have seen valid (readily human-readable) sentences even longer at 140 tokens, so the number of words too can reach or exceed 100 more frequently than commonly assumed.

>> I'd guess that such sentences usually contain one or more dependent clauses, that can be parsed separately if necessary.

Absolutely. Often more than one independent clause and several dependent clauses. But I am not aware how to identify these and parse them separately. Can you please shed some light? Are there for example some simpler grammars available that do not need to do the full parse to identify these clauses?

I haven't tried such a thing (yet), but there is some work in that area. Advaith Siddhartan's thesis may be a good start:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.8...

Thanks!

Here is a study on average sentence length showing it at 24 words per sentence: http://ds.nahoo.net/Academic/Maths/Sentence.html

In my own study, I performed analysis on a corpus of a few hundred million sentences of written text and found it at 31 words per sentence.

It would be great if you can point to the use of GPU for parsing.

I had never heard of the bitpar parser, will look into it.

My average sentence length of 15 came from a bunch of literary novels. The non-literary fiction had even less, at 11 words. There is obviously quite some variance here, but actually it shouldn't matter for a parser because humans appear to parse in linear time (given a well written sentence).