Hacker News new | ask | show | jobs
by joosters 3692 days ago
I don't see how a linguistic parser can cope with all the ambiguities in human speech or writing. It's more than a problem of semantics, you also have to know things about the world in which we live in order to make sense of which syntactic structure is correct.

e.g. take a sentence like "The cat sat on the rug. It meowed." Did the cat meow, or did the rug meow? You can't determine that by semantics, you have to know that cats meow and rugs don't. So to parse language well, you need to know an awful lot about the real world. Simply training your parser on lots of text and throwing neural nets at the code isn't going to fix this problem.

4 comments

This is exactly the type of problem that a good parser should be able to solve, and training a parser on lots of data and throwing neural nets may indeed be a viable solution. Why wouldn't it be? The article describes how their architecture can help make sense of ambiguity.

In terms of a basic probabilistic model, P(meow | rug) would be far lower than P(meow | cat), and that alone would be enough to influence the parser to make the correct decision. Now, if the sentence were "The cat sat on the rug. It was furry", that would be more ambiguous, just like it is for an actual human to decode. But models trained on real-world data do learn about the world.

That's exactly why it's using a neural net and yes, a lot of text will fix this problem. The only reason why we know cats meow and rugs don't is by learning about cats and rugs. Throw enough training data at it and the parser will figure out what is meowing.

An interesting example of this you can easily try for yourself is playing with Google's voice to text features--if you say silly things like "the rug meowed" you will have terrible results because no matter how clearly it can hear you its training data tells it that makes no sense.

This is actually a rather serious limitation of statistical approaches to language: they work best with utterances that have already been said, or with concepts that are already strongly associated in common speech. Such utterances may make up the bulk of what we say and write, but the remainder isn't gobbledygook. It contains most of the intimacy, poetry, and humor of interpersonal communication, all of which trade on surprise and novelty.
That's basically Chomsky's argument with the "colorless green ideas" sentence. If you put words together to form a sentence never seen before, supposedly a statistical model cannot help you. The thing is, a paper later showed that a simple Markov model is actually perfectly able to discriminate this grammatical sentence from an ungrammatical one. Novel and surprising sentences are never completely alien. They use familiar structures of the language, and combinations of words and other building blocks that we have seen before, and this is exploited when we analyze such sentences. Surprise and novelty are actually strongly related to statistics (cf. information theory).
Isn't that how we learn? We interpret odd words as gobbledygook until we look them up or find out how they're used.
That is not necessarily true. The problem of ambiguity is fundamental to natural language processing, and a lot of research goes in to addressing it. If we also see a sentence where unambiguously the word "cat" is the subject of the verb "meow", then this could give our parser clues about ambiguous attachment or, in this case, anaphora resolution (to what does "it" refer). In any intro NLP class, you will learn about lexicalized parsing, which takes the head word of the phrase into account when making parsing decisions. I haven't read the paper on this parser yet, but I don't think it is hard to see that your sentence could be accurately parsed given enough data. Look up "word embeddings" for instance, which are fundamental to deep learning for NLP and could probably be trained to assist in disambiguating anaphora or attachment.
> You can't determine that by semantics

Actually, "animacy" is a fundamental feature in semantics. It's part of your mental lexicon that a cat is animate and a rug isn't, and you would simply infer from that which is the referent for "it". As semantic challenges go, this is a very trivial one. In general the border between linguistic and world knowledge can become blurred. There may be limits to what can be learned purely from text, but seeing that this model achieved 94%, a lot can be learned purely from (annotated) text.