Hacker News new | ask | show | jobs
by jonknee 3692 days ago
That's exactly why it's using a neural net and yes, a lot of text will fix this problem. The only reason why we know cats meow and rugs don't is by learning about cats and rugs. Throw enough training data at it and the parser will figure out what is meowing.

An interesting example of this you can easily try for yourself is playing with Google's voice to text features--if you say silly things like "the rug meowed" you will have terrible results because no matter how clearly it can hear you its training data tells it that makes no sense.

1 comments

This is actually a rather serious limitation of statistical approaches to language: they work best with utterances that have already been said, or with concepts that are already strongly associated in common speech. Such utterances may make up the bulk of what we say and write, but the remainder isn't gobbledygook. It contains most of the intimacy, poetry, and humor of interpersonal communication, all of which trade on surprise and novelty.
That's basically Chomsky's argument with the "colorless green ideas" sentence. If you put words together to form a sentence never seen before, supposedly a statistical model cannot help you. The thing is, a paper later showed that a simple Markov model is actually perfectly able to discriminate this grammatical sentence from an ungrammatical one. Novel and surprising sentences are never completely alien. They use familiar structures of the language, and combinations of words and other building blocks that we have seen before, and this is exploited when we analyze such sentences. Surprise and novelty are actually strongly related to statistics (cf. information theory).
Isn't that how we learn? We interpret odd words as gobbledygook until we look them up or find out how they're used.