Hacker News new | ask | show | jobs
An Efficient Way to Extract the Main Topics from a Sentence (thetokenizer.com)
232 points by shlomib 4783 days ago
15 comments

This is neat.

Shlomib, you might be interested in SHRDLU [1][2] if you are not aware of it. It was developed by Terry Winograd [3] for his dissertation [4] at MIT. It is a natural language understanding [5] parser that allows you to interact with a small world of 3D solids.

I think you will find the paper interesting, because it goes into detail on sentence structure and associated parsing.

Here is a sample dialogue with the program, taken from Wikipedia:

  Person:   Pick up a big red block.
  Computer: OK.
  Person:   Grasp the pyramid.
  Computer: I don't understand which pyramid you mean.
  Person    (changing their mind): Find a block which is taller than the one you are holding and put it into the box.
  Computer: By "it", I assume you mean the block which is taller than the one I am holding.
  Computer: OK.
  Person:   What does the box contain?
  Computer: The blue pyramid and the blue block.
  Person:   What is the pyramid supported by?
  Computer: The box.
  Person:   How many blocks are not in the box?
  Computer: Four of them.
  Person:   Is at least one of them narrower than the one which I told you to pick up?
  Computer: Yes, the red cube.
[1] https://en.wikipedia.org/wiki/SHRDLU

[2] http://hci.stanford.edu/winograd/shrdlu/

[3] https://en.wikipedia.org/wiki/Terry_Winograd

[4] http://hci.stanford.edu/winograd/shrdlu/AITR-235.pdf

[5] https://en.wikipedia.org/wiki/Natural_language_understanding

Why does this seem so scary/amazing?

It's basically taking what computers already do really well today and adding a language "mask" onto it. Yet it still blows me away.

What's really amazing is how old SHRDLU is. It was developed back in the late 60's, early 70's.

Although... I guess you could look at it two ways: Be amazed at what SHRDLU could do in 1970, or be disappointed that, given that we had that in 1970, we don't have the "Star Trek Computer" yet in 2013.

SHRDLU is definitely amazing, especially given its age, but one's amazement is tempered a little bit (or maybe enhanced, depending on perspective) when you realize that it achieved what it did primarily through really great engineering rather that some fundamental insight about language. Since SHRDLU's world is so limited, Winograd was able to explicitly program every facet of its language understanding. Unsurprisingly, this approach is totally not scalable and this reveals a little about why we don't have fully human-like language programs.
I think people underestimate how explicitly-programmed human language is in humans. I'm starting to think that this might be the central problem in NLP right now.

Humans have good natural pattern-matching engines in their heads, but the entire body of syntax and vocabulary available to a person is the result of the memorization of a huge amount of text. I suspect the majority of people rarely ever develop truly novel words or phrases on their own (with the notable exception of Lewis Carroll). (Aside: in fact, this is exactly how "memes" work in the modern online sense; one person invents a novel word or phrase, and that is then parroted by a huge number of other people.)

I recently started work on an attempt to improve the classification of English vocabulary by grade level. I built a database using publicly-available sources, and the number of unique words that the average child has been exposed to by the 8th grade is mind boggling. One source cited 15,000 unique words and over a million words read annually.

Aside from the words themselves, children have also by that age memorized an even larger number of phrases, pieces of sentence structure, and full sentences.

I think that because we aren't able to enumerate everything we've memorized, we don't fully appreciate just how much data is stored in our heads. As a result, I think it's possible that computer science researchers have largely been chasing a ghost in terms of some kind of magical "understanding" of language; the answer to NLP might actually be to simply store and access a terabytes-sized data structure of vocabulary and phrases.

The kind of "programming" that you are describing is fundamentally different than what Winograd did, and that was my point. This learning from many examples is an instance of inductive inference, and the complexity involved is why modern NLP research (and you in your project) uses machine learning techniques with massive datasets -- this more closely mimics the way we naturally acquire language. Trying to hand engineer all those rules and dependencies and exceptions is prohibitively difficult, which is why we have Siri and not SHRDLU+.
Just because we memorize a whole lot (which I agree with) does not mean that language is likely to be "pre-programmed" in the way that SHRDLU follows explicit, exhaustive rules. Formulating such rules requires planning because they are brittle, and this does not seem compatible with the way language acquisition happens.

Also, after accepting the premise that humans exploit an enormous store of data in language use, there still remain very difficult questions about what kind of representations we have available, and how powerful the search and recombination mechanisms are.

Memory-based language processing exists for some time now, and while it is useful, it is certainly not the final answer to "the central problem in NLP" (whatever you define that to be, I'd suggest ambiguity resolution).

the answer to NLP might actually be to simply store and access a terabytes-sized data structure of vocabulary and phrases.

Isn't that effectively what google translate is doing? And it's results are... varied.

Based on my experience, the hilarious thing about NLP is that it is easy for humans to generate easy to parse sentences like "Facebook acquires Instagram.", but if you are trying to parse a naturally flowing conversion, you rarely get easy examples like that. There is so much context in our conversations.
>> I recently started work on an attempt to improve the classification of English vocabulary by grade level.

I would be interested in this. Let me know if you plan to open this. What data sources are you using?

Since SHRDLU's world is so limited, Winograd was able to explicitly program every facet of its language understanding. Unsurprisingly, this approach is totally not scalable and this reveals a little about why we don't have fully human-like language programs.

That's a good point. It does lead one to wonder, however, if techniques inspired to SHRDLU could (or do) have application in domain-specific applications where the world is likewise restricted. Given the increases in raw horsepower available since SHURDLU was first developed, I find myself wondering if we couldn't do some pretty useful things today, using this approach.

Yes. For example, consider interlingual machine translation. Most systems today (like Google) use statistical MT that learns patterns from millions of examples. In interlingua, by contrast, you analyze the input sentence to form a language-independent representation of the sentence's meaning. Then you use that representation to generate a sentence in a new language.

As you might expect, this is basically impossible for wide-domain MT because we don't have unambiguous representations of the meaning of every sentence, and we don't necessarily know how to combine them, and there's a lot of non-compositional phrases, and on and on.

However, if we restrict ourselves to one small domain, interlingua can work. For example, the KANT system [1] is an interlingua that is built for translating technical manuals for Caterpillar products (bulldozers and so on). The input has to be written in a restricted subset of English (Caterpillar Technical English), but then you can analyze it exactly with hand-written rules, and produce exact output in the target language.

[1] http://www2.lti.cs.cmu.edu/Research/Kant/

Firstly, we have done similar things. For example, we have/had http://en.wikipedia.org/wiki/METEO_System for weather reports (use "machine translation weather reports" to google Scientific literature. Among others, that finds information that work is being done on a Croatian version of this). I think there have been successes in the medical field, too, but cannot find them.

However, this 'knowledge engineering' approach to AI has fallen somewhat out of fashion a bit in favourite of statistical methods (however, I don't think anybody does statistics 'from scratch'. For example, in NLP, you could try to statistically learn the definite articles in English, but hard-coding that 'the' is the only one will get you results faster.

Woah, that's eliza for CAD in steroids.
Cool idea, but basically reinvents chunking, which NLTK already has (http://nltk.org/api/nltk.chunk.html). Keep up the writing and NLP research though :)
Nice writeup. A few comments:

So you're just identifying NPs and VPs in a sentence? So lets say I run your program, and I get NPs "Instagram" and "Facebook", and the VP "acquired." The question is, who did what to whom? Did Facebook acquire Instagram, or did Instagram acquire Facebook?

Second, I think you're way over-emphasizing the supposed slowness of CFG parsing. Yes, the complexity is O(n^3) in the length of the sentence, but in practice, n is usually small. Modern statistical PCFG parsers are fast.

My understanding is different, so please correct me / supply missing information.

From what I understand, the average length of a typical written sentence is n = ~27. OK, this is small by itself, but Stanford Parser (lexicalized PCFG) I am using needs about 1 second to parse a sentence of this size. Imagine how slow that is on a computer time-scale by comparing it to string-length operation on the same sentence.

I do encounter many sentences that are as much as 100 words long (and reading them myself, find nothing wrong with them). At about four times the length, these take about a minute to parse!

I am trying to find information about speeds of other PCFG parsers, including Collins, Charniak, Berkeley, etc. I understand dependency parsers are faster but also generally lag in accuracy.

The Stanford parser is particularly slow --- it's in java, and it's written for research more than anything. The C&C CCG parser runs at about 60-80 sentences a second, although it gives either CCG constituents or dependencies -- so the output may take some interpretation.

Shift-reduce dependency parsers are linear time, and are giving state-of-the-art results. My parser's currently a pain in the ass to install, as it hasn't really been released yet, but it does hundreds of sentences a second. Accuracy is state-of-the-art -- 92-93% depending on the beam width and the evaluation set (Stanford or MALT dependencies).

https://github.com/syllog1sm/redshift/ . You'll want the develop branch. It's GPL licensed.

It's implemented in Cython (i.e., almost all the code is Cython --- I'm not using it just for the speed critical bits), which would make it easy to work with if you're using Python. But, as I said...I don't claim it's currently fit for human consumption.

A C++ implementation of the same algorithm is here: http://www.sutd.edu.sg/yuezhang.aspx . Note his papers too -- he did some of the important work on this line of research.

The last few years of work in shift-reduce dependency parsing have been a bit of a break-through in parsing, imo.

The Stanford parser is particularly slow --- it's in java, and it's written for research more than anything.

The choice of language is not what causes the slowness of the Stanford parser. It's the choice of search strategy, which trades-off speed for accuracy.

Shift-reduce dependency parsers are linear time, and are giving state-of-the-art results.

No, this is incorrect.

The choice of parsing logic (shift-reduce, dependency) and the search strategy (greedy, sometimes erroneously called "deterministic") are orthogonal. It's the greedy search strategy that leads to linear time performance.

The choice of logic determines the lower-bound (best-case) on parsing complexity. If you do exhaustive search for the exact solution of a shift-reduce dependency parser, it is worst-case exponential. In practice, you don't do exact search, and by using a beam search approximation you can get observed linear-time performance.

[edit: You can read my thesis if you are not familiar with what a parsing logic is.]

I am not aware of state-of-the-art results from greedy shift-reduce parsers. Do you mind sharing?

> The choice of language is not what causes the slowness of the Stanford parser. It's the choice of search strategy, which trades-off speed for accuracy.

Even amongst chart parsers it's not a very quick implementation.

> The choice of parsing logic (shift-reduce, dependency) and the search strategy (greedy, sometimes erroneously called "deterministic") are orthogonal. It's the greedy search strategy that leads to linear time performance.

They're really only conceptually orthogonal, because in practice once you choose shift-reduce you're always going to choose greedy/deterministic/whatever search. I'm not aware of any shift-reduce/transition-based parsing results that don't use 1-best or beam search.

> I am not aware of state-of-the-art results from greedy shift-reduce parsers. Do you mind sharing?

Zhang and Nivre (2011) http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.p... 93.5 UAS on Stanford basic dependencies. This is the best published result on the dataset, and the best published dependency parsing result for English.

Probably the parser is still worse than the C&J reranking parser, when 200 parses are supplied to the reranker. But I'd be very surprised if it wasn't better than the Stanford parser, and probably also the Berkeley parser.

Accuracy is state-of-the-art -- 92-93% depending on the beam width and the evaluation set (Stanford or MALT dependencies).

I assume that this is for English? A former colleague of mine compared two statistical dependency parsers (Malt and MST) to a rule-based dependency parser with a maxent disambiguation model, for Dutch. The rule-based system outperforms the statistical dependency parsers by a wide margin, both in-domain and out-of-domain:

http://dl.acm.org/citation.cfm?id=1870171

Nonetheless, I find work on statistical dependency parsing to be very exiting, since it is fast and requires far less human effort :).

https://github.com/syllog1sm/redshift/ . You'll want the develop branch. It's GPL licensed.

Very nice work!

Yes, for English. There's a standard multi-lingual evaluation for statistical dependency parsers (the CoNLL 2007 data), but none for constituency parsers.

I had a look at that paper, but didn't read it carefully. All I can really say is that there's a real evaluation problem between rule-based and statistical parsers. Rule-based parsers recover richer representations, but tend to have lower coverage over arbitrary data --- they normally can't guarantee that a parse is returned; they may deem the sentence ungrammatical.

In that paper, the parsers were evaluated on "home ground" for the rule-based parser, as they used the treebank created for it. Having worked on the CCG formalism through my PhD, I can say that even small differences in annotation scheme can make a big difference in which parsers come out ahead.

Rule-based parsers recover richer representations, but tend to have lower coverage over arbitrary data

This parser usually has a coverage around ~95-97%. In one experiment we also parsed Flemish (which uses some constructions that wouldn't be considered grammatical in Dutch) and obtained a coverage of ~94%.

It also has a robustness component that attempts to provide an analysis for as many constituents as possible if no fully spanning parse can be found.

Having worked on the CCG formalism through my PhD, I can say that even small differences in annotation scheme can make a big difference in which parsers come out ahead.

Certainly. But you are not mentioning the other elephant in the room: Dutch is a free word order language and also permits very liberal ordering in the middle field. The rule-based grammar may benefit from the detailed constraints in the lexical attribute-value structures.

I have the impression that the average sentence length is more on the order of 15 words per sentence, but I might be wrong.

The cubic time complexity is for exhaustively finding the best parse. In practice you can use various approximation techniques, such as coarse-to-fine parsing used by the Charniak & Berkeley parsers. I believe these two are faster than the Stanford Parser, and the parameters of the approximation can be tuned to have faster parsing at the expense of accuracy. You could probably get a reasonable parse tree for a sentence in one second. The fastest PCFG parser which does not do such approximations is bitpar. Another avenue is to try to use the GPU; there has been some research into this.

Dependency parsers are indeed faster but in general their accuracy cannot be compared to constituency parsers, because the information in the output is of a different nature.

I have the impression that the average sentence length is more on the order of 15 words per sentence, but I might be wrong.

It highly depends on the nature of the material. But, e.g. in Dutch based on samples of newspapers I found that the average is between 16 and 20 tokens.

The grandparent mentions sentences of 100 word (I take that he means tokens). I'd guess that such sentences usually contain one or more dependent clauses, that can be parsed separately if necessary.

In practice you can use various approximation techniques, such as coarse-to-fine parsing used by the Charniak & Berkeley parsers.

Indeed, there are many possible optimizations. Such as: restricting the number of lexical analyses using a part-of-speech tagger (particularly useful in highly lexicalized grammars), guided parsing (e.g. by filtering partial left-corner splines that never lead to a good parse), and beam search.

Dependency parsers are indeed faster

Purely statistical dependency parsers are faster. There are also dependency parsers that use (rule-based) unification grammars. Such parsers can produce dependency structure as a side-effect or even post-processing step and rely on relatively expensive unification of attribute-value structures.

>> The grandparent mentions sentences of 100 word (I take that he means tokens).

Correct. Though I have seen valid (readily human-readable) sentences even longer at 140 tokens, so the number of words too can reach or exceed 100 more frequently than commonly assumed.

>> I'd guess that such sentences usually contain one or more dependent clauses, that can be parsed separately if necessary.

Absolutely. Often more than one independent clause and several dependent clauses. But I am not aware how to identify these and parse them separately. Can you please shed some light? Are there for example some simpler grammars available that do not need to do the full parse to identify these clauses?

I haven't tried such a thing (yet), but there is some work in that area. Advaith Siddhartan's thesis may be a good start:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.8...

Thanks!

Here is a study on average sentence length showing it at 24 words per sentence: http://ds.nahoo.net/Academic/Maths/Sentence.html

In my own study, I performed analysis on a corpus of a few hundred million sentences of written text and found it at 31 words per sentence.

It would be great if you can point to the use of GPU for parsing.

I had never heard of the bitpar parser, will look into it.

My average sentence length of 15 came from a bunch of literary novels. The non-literary fiction had even less, at 11 words. There is obviously quite some variance here, but actually it shouldn't matter for a parser because humans appear to parse in linear time (given a well written sentence).
Re: your first point.

In English, the first noun is almost always the actor of the verb, with subsequent nouns being acted upon. This isn't true for other languages though.

(I'm sure someone will come up with an exception to that rule but for 90% of the cases this is true).

Except in passive constructions, but maybe that's an easy case to detect.
The sentence subject is one thing, the sentence <i>topic</i> might be quite another.

Consider sentences like: "He joined the not-yet-famous Liverpool band in early 1958."

To many human beings the topic is quickly obvious. Parsing won't do the trick.

The only reason that sentence is "obvious" to many people is because we have a reference to a famous band from Liverpool that got its start in the late 50's/early 60's that is already embedded in our brain's library of facts.

Removed from that context human beings see that sentence as equally meaningless as a parser, because it is. I'd imagine many young people (who don't have the "correct" reference points) wouldn't have a clue that the sentence is about George Harrison.

In order to properly handle this sentence one would need the same external reference that your brain has. Without the reference the sentence can be discarded as incomplete, since that's what a human would do too.

You don't necessarily need knowledge to distinguish a topic from a subject though. It's a grammatical distinction. See

http://en.wikipedia.org/wiki/Topic%E2%80%93comment#Definitio... e.g. in the sentence (3) As for the little girl, the dog bit her "the dog" is the subject NP, "the little girl" is the topic. That toy example is perfectly parsable without semantics or even probabilities (though take any real-world sentence and I'm betting you'll need more than just grammar).

Thank you for elaborating my point. But a machine don't need an enormous library of facts to find the topic, just a small one. The words 'he', 'band', 'Liverpool' and '1958' are enough to get very precise ... but not because of their grammatical position.
This is why systems such as MIT's ConceptNet exist. NLTK and other NLP systems can provide the grammatical breakdown and ConceptNet can fill in the semantic / contextual gaps.
Ask anyone under ~30, they won't know what you're talking about.
OK just for you: "His 4th album was the second best-selling of the 2000s." The subject of the sentence is "his".

Point remains. Without a context you're lost.

My existence proves this sentence wrong.
Googling will though.
This is neat!

The article gives an example which I find a bit confusing.

>I ran it on this sentence -

> “Swayy is a beautiful new dashboard for discovering and curating online content.”

>And got this result -

> This sentence is about: Swayy, beautiful new dashboard, online content

That misses "discovering" and "curating", which I think are the most important parts of that sentence.

Nah, it filtered out the meaningless buzzwords.

It's a dashboard for online content. That pretty much implies the picking and finding of said online content to be displayed on the dashboard.

Huh? "Curating" and "discovering" may be overused tech verbs, but they are vital in describing what the "dashboard" does. For example, you would never describe the Google Analytics dashboard as something that curates or discovers.

And far worse than buzzwords are adjectives. Does "beautiful" add anything to that sentence?

I disagree. Dashboards curate by definition. Some car dashboards display have a tachometer, some don't it depends on the car's focus. Google Analytics similarly doesn't display all available information, just that which will be useful. Dashboards always display a limited subset of available information that makes sense to the current context aka curated.

Discovering is also vacuous. The whole point of a dashboard is to convey information. Do I discover how fast I am going by looking at my car's dashboard? Do I discover my website's traffic by going to Google analytics? Sure, I wouldn't use them if I didn't get the information I need from them. So using an online content dashboard that doesn't deliver online content of some sort would be a waste of time.

Beautiful adds something because not all dashboards are beautiful.

Google Analytics has nothing whatsoever to do with online content.
You're missing the point. The OP is talking about a system for interpreting sentences in bulk and extracting useful keywords. "beautiful new" are not useful, and arguably, "dashboard" is not particularly useful. "Curating" and "discovering", while grating to our ears, are definitely descriptive words of purpose...because there are "dashboards" that have nothing to do with "curating"...so ostensibly, "curating" has some use as a keyword
This is because he is only extracting the noun phrases from the sentence. If you adapted his code to tag verb phrases as well (by modifying the semi-CFG and the normalize_tags method) then you could also extract "discovering" and "curating" as well.
But this would miss the "main topics," since when you have both the vp's and the np's you have everything :/ Here is the resulting tree (it's unformatted, sorry, I tweaked an old Prolog grammar I had for analysing search keywords and tweets):

[[[[Swayy,snp],np],[is,[a,[[beautiful,new,dashboard,snp],np],np_],vp],simple_s],for,[[discovering,simple_s],and,[[curating,[[online,content,snp],np],vp],simple_s],s],s]

Or you could just drop stop words and gerunds. This is another post from "TheTokenizer" that over simplifies a complex problem and creates devistatingly bad results.

The method described doesn't tell you what the sentence is about it tells you which things aren't verbs and articles and does a poor job of it.

Granted single sentence keyword extraction is not easy, but this is an awful approach. You'd be much better using Word Frequency analysis to determine the rarest words in the sentence.

Just apply TFIDF to text and it extracts the most interesting words out of the phrase - it's dead simple. You just count words and do a little scoring and sorting. Example applied to tweets. Check out how the least significant words come out last. Some words have been dropped (those with frequency less than 5 in a corpus of a few million phrases).

------

- "math final today 6-17-09 piece of cake hopefully i should do well since i m a math nerd amp english amp social"

- math, nerd, studies, piece, cake, english, amp, final, hopefully, social, since, should, well, today, do, of,

------

- "anyone want an incredibly designed unique limited edition tee for the summer check out www artcotic com"

- tee, designed, incredibly, unique, edition, limited, summer, anyone, check, www, want, com, an, out, for, the,

As a working computational linguist I shiver anytime human language technology is discussed on HN. Smart developers (who don't work on human language) are shockingly ignorant about natural language processing. For instance, this article reinvents "chunking". People who are interested in these problems are advised to read the entire NLTK book and Jurafsky & Martin textbook before reinventing square wheels. </$.02>
For some NLP, I really suggest using OpenNLP (http://opennlp.apache.org/) from Apache. It has libraries that can be trained to do different NLP tasks like sentence splitting, tokenizers, POS tagging, and document classification. I still didn't manage to use all of them but in my experience, it's very easy to use. Documentation is good too!
I'm a fan of OpenNLP as well, although I haven't done a lot of performance evaluation around it yet. Apache Stanbol[1] is also a very interesting project, which leverages OpenNLP (among other things) for doing semantic entity extraction from text.

Also, FWIW, I wrote an article[2] a while back, focusing on Open Source NLP tools. It was aimed slightly more at business users than developers, so it doesn't dig real deep on the tech side, but there is a list of popular OSS NLP tools that people interested in this topic might find useful.

And if I can throw in another shameless plug (only because I think it will genuinely be of interest, of course), I'll point out this post[3] on Prolog resources, since Prolog often finds application in the NLP world.

[1]: http://stanbol.apache.org

[2]: http://osintegrators.com/opensoftwareintegrators|howyoucanbe...

[3]: http://fogbeam.blogspot.com/2013/05/prolog-im-going-to-learn...

And if I can throw in another shameless plug (only because I think it will genuinely be of interest, of course), I'll point out this post[3] on Prolog resources, since Prolog often find application in the NLP world

You missed the nicest and most satisfying book ;):

http://www.mtome.com/Publications/PNLA/pnla-digital.html

It is simultaneously an introduction to Prolog and natural language parsing using Prolog.

Very cool. That post was originally written quite some time ago, and it was never meant to be an exhaustive list. That said, I'll add this to the list as well. Thanks for the pointer!
Good job. But I need to make the note that the title is confusing to NLP/ML practitioners. 'topics' usually refer to clusters as captured by so-called 'topic models' [1], the output of an unsupervised learning method, usually a variant of LDA. [1] http://en.wikipedia.org/wiki/Topic_model
I was interested in this to parse some user entered data extracted from Facebook, and found Text Razor[1] to be pretty good at this.

Natural Language is a beautiful thing.

[1] http://www.textrazor.com/

Without having Brown and NLTK in Node.js, I'm not sure how well I can add this to my port of shlomibs original code. For those who haven't seen yet, I wrote a port of the first part of this here https://github.com/jbrooksuk/node-summary

Maybe later I'll give it a crack :)

Heh, I raised an issue on my GitHub page https://github.com/jbrooksuk/node-summary/issues/2 and found natural shortly after.
Wordnet & the Brill Pos tagged corpus do the trick, more or less
https://github.com/taf2/rb-brill-tagger. For anyone usin ruby can do something very similar much of the code is c++ with smaller ruby API ... It's pretty good bug reports welcome...
What happens when you apply it repeatedly? Does the text keep shrinking?
I smell a $30 million acquisition in the near future..
By those standards, many companies will be worth billions ;).

(Chunking is not exactly new and PCFG parsing is pretty fast these days.)

I think this was a sarcastic commentary on the Summly acquisition.
bingo :)