Hacker News new | ask | show | jobs
by teraflop 3692 days ago
This is really cool, and props to Google for making it publicly available.

The blog post says this can be used as a building block for natural language understanding applications. Does anyone have examples of how that might work? Parse trees are cool to look at, but what can I do with them?

For instance, let's say I'm interested in doing text classification. I can imagine that the parse tree would convey more semantic information than just a bag of words. Should I be turning the edges and vertices of the tree into a feature vectors somehow? I can think of a few half-baked ideas off the top of my head, but I'm sure other people have already spent a lot of time thinking about this, and I'm wondering if there are any "best practices".

7 comments

This would be very interesting when applied to Biblical Studies. Any serious academic discussion of biblical texts will involve syntactical breakdown of the text being discussed. Most of the times the ambiguities are clear, but its still quite common for a phrase to have several possible syntactical arrangements that are not immediately clear. These ambiguities are also challenging becuase the languages are dead (at least as used in the biblical texts). So the type of ambiguity of "Alice drove down the street in her car" can lead to some significant scholarly disagreement.

I could see Parsey McParseface helping identifying patterns in literature contemporaneous to the biblical texts. Certain idiomatic uses of syntax, which would have been obvious to the original readers, could be identified much more quickly.

I was going to say... my main interest in this project is precisely for Biblical studies... I could talk about analyzing the Bible for hours, but let's just say there's way more depth than many even realize. The Aleph Tav in relation to the Book of Revelation is one such example, many translations omit it, but the Aleph Tav Study Bible explores it in depth. There could be many discoveries made with these kind of projects that are missed by just about anyone only reading a translation.

There are a ton of Jewish Idioms in the Bible that many don't understand at all, including "No man knows the day or the hour" which is a traditional Jewish Wedding Idiom. Lots and lots of things could be explored with enough data and resources.

I'd think that the advantage of machine translation is on corpora that are not known up front (i.e. user-supplied text) or corpora that are exceptionally large.

If you have a small (ish), well-known text, I don't think you will get much insight from machine translation. Certainly there are plenty of uses for computer text analysis/mining in biblical studies, but I doubt translation is one of them. And for obscure idioms or hapax legomena, machine translation definitely can't help you because by definition there are no other sources to rely on.

With a sufficient level of precision, there's room for machine analysis to "reveal" things we are ignoring out of custom. A lot of text analysis done by people is full of biases and deferral to authorities.

E.g. I remember from school getting in into an argument with a teacher over the interpretation of a poem. "His" interpretation, which was really the interpretation of some authority who'd written a book was blatantly contradicted by the text if you assumed that the author hadn't suddenly forgotten all his basic grammar despite all the evidence to the contrary everywhere else that he was always very precise in this respect.

Of course, in some of these kind of instances, it will be incredibly hard to overcome the retort that any "revelation" is just a bug.

In a more general sense, people are typically exceedingly bad at parsing text, judging by how often online debates devolve into bickering caused largely by misunderstanding the other party's argument. Often to the extent of even ending up arguing against people who you agree with. Having tools that help clarify the parsing for people might be interesting in that respect too.

Well I wouldn't look for idioms, but it would be interesting to throw in information such as "Strong's Concordance" into the mix, I've yet to really think of an application for this library fully, but it would be fun to play around with it nonetheless. I would be analyzing the Hebrew / Greek / Syriac scripts, seeking verses omitted, or missing, etc. It would make for interesting studying if anything.
You might be interested in Andrew Bannister's research on computer analysis of the Quran. He wrote a book on it [1], and there's also this paper which gives a high-level overview [2].

[1] http://www.amazon.com/Oral-Formulaic-Study-Quran-Andrew-Bann...

[2] http://www.academia.edu/9490706/Retelling_the_Tale_A_Compute...

> Any serious academic discussion of biblical texts will involve syntactical breakdown of the text being discussed.

I once interned for a company that's been doing this for years. They have all kinds of features tracing individual words through various different languages, etc.

https://www.logos.com/

Actually it's not very appropriate for studying bible text. In Biblical Studies you would prefer not to have any errors at all, and since you work with a limited corpus you can afford to annotate by hand. People have in fact done this and I collaborated with a group that has been working on this for decades.
For actual syntactical breakdown of the Bible, I agree. Biblical Scholars, and even competent pastors, can syntactically analyze the the Bible sufficiently well.

I would think the technology could be helpful in a fairly narrow way: identifying syntactical constructions outside the bible to help explain ambiguous syntactical constructions within it (For example, Ugaritic texts, another ancient Semitic language similar to Hebrew, are often studied to aid in understanding portions of the Old Testament). Scholars have been doing this without computers for some time and have begun to do this type of analysis with software. I would imagine more sophisticated software would yield at least some new insights.

Most of the really good applications are part of larger systems. Parsing is good in machine translation, for instance. You transform the source text so that it's closer to the target language. Parsing is also useful for question answering, information extraction, text-to-speech...

Here's an example of using information from a syntactic parser to decorate words, and create an enhanced bag-of-words model: https://spacy.io/demos/sense2vec

Here's a very terse explanation of using them in a rule-based way: https://spacy.io/docs/tutorials/syntax-search

This is actually really useful for a project I'm working on. I'm trying to detect bias in news sources using sentiment analysis and one of the problems I've run into is identifying who exactly is the subject of a sentence. Using this could be really helpful in parsing out the noun phrases and breaking them down in order to find the subject.
I've been experimenting with Stanford's CoreNLP to identify named entities for analyzing RSS feeds and I was really impressed by how well it worked, having known nothing about the state of NLP research before I started. Especially things like being able to identify coreferences.
I was actually pretty disappointed with the NER in CoreNLP - I fed a few articles (including this one) into it, and while it's impressive that a computer can do this at all, it's pretty far away from being able to build a usable product. It seems to over-recognize Persons, for example - Parsey McParseFace was tagged as a person, as were Alice and Bob, as was Tesla (in another article), and while all of these are understandable, they weren't the intended meanings in the articles. I was also pretty disappointed with the date parser: while it gets some tricky ones like "Today" and "7 hours ago", it misses very common abbreviations like 7m or 7min or even "7min ago".
The state of NLP tools generally is much lower than most people think. People think it is much easier than it is.

For the date parser you want http://nlp.stanford.edu/software/sutime.html

The code and rules aren't fun to customize though.

Yeah, I looked at SuTime, but it fell down on many common cases (the CoreNLP online demo is actually integrating SuTime into the annotations it produces).

Another option is Natty [1], but it also seems to fail on the same examples. Natty at least has an ANTLR grammar that's reasonably easy to understand, though.

[1] http://natty.joestelmach.com/

I know of one large group that switched (from Timen[1]) to Heideltime[2] because of multi-language support.

One day someone will build a neural net model to do this rather than hand written rules.

[1] https://github.com/leondz/timen

[2] https://github.com/HeidelTime/heideltime

Heideltime is the best date tagger I've used. Handles multiple languages better than anything else and fits inside uima

https://github.com/HeidelTime/heideltime

Yes, I've used that before. I'm currently using Textacy for python which is also really good. However, extracting the named entities from a sentence is still a ways off from determining what's the subject of the sentence, although it gives a good indication. Using NER + quality POS tagging and tree building should do the trick for me I think.
Are you using BOW for sentiment analysis? Also, have you tried tinkering with Watson's sentiment analyzer?

I'm working on a project that analyzes sentiment from speech, and I've been meaning to start on text sentiment analysis, but I'm not sure where to start.

I'm using VADER - https://github.com/cjhutto/vaderSentiment - because it's trained on NYT data which makes it suitable for news sentiment parsing.

The code is pretty readable but relies heavily on a ruleset which might need to be tweaked for one's need.

I think the magic word here is "aspect based" sentiment analysis cf http://datascience.stackexchange.com/a/4870
Here is an application of parse trees: sentiment analysis with recursive neural networks based on how components of the parse tree combine to create the overall meaning.

http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

They are useful as a preprocessing step for a lot of downstream NLP tasks. It shouldn't be hard to find more papers that take advantage of the tree structure of language.

The typical approach is something like a tree kernel (https://en.wikipedia.org/wiki/Tree_kernel). Looked into them briefly for a work project that never got off the ground, can't say too much about using them in practice.
> Parse trees are cool to look at, but what can I do with them?

One really simple and obvious thing is word sense disambiguation. Plenty of homonyms are different parts of speech (e.g. the verb "lead" and the noun "lead"). I'm sure there's lots of more sophisticated stuff you can do as well, but this might be the lowest-hanging fruit.

However, for that you just need PoS tags (which is also provided by this Google thing, yes). And of course the hard part of WSD is detecting whether "bank" refers to the bank of a river, or the financial institution, or the building where the institution is located, or [you name it].

I use parse trees as a kind of "advanced language model" for when I need to replace a word in a sentence (see for example: http://www.aclweb.org/anthology/P13-1142 ), it's so much better than using just simple n-grams.

Idea: point this at political speeches / security breach notifications / outage postmortems / etc, and rate them by how many ambiguities with starkly different dependancy parses there are... (Well of _course_ we mean the roads inside Alice's car when we made that commitment!)