Hacker News new | ask | show | jobs
by atdt 3696 days ago
This is actually a rather serious limitation of statistical approaches to language: they work best with utterances that have already been said, or with concepts that are already strongly associated in common speech. Such utterances may make up the bulk of what we say and write, but the remainder isn't gobbledygook. It contains most of the intimacy, poetry, and humor of interpersonal communication, all of which trade on surprise and novelty.
2 comments

That's basically Chomsky's argument with the "colorless green ideas" sentence. If you put words together to form a sentence never seen before, supposedly a statistical model cannot help you. The thing is, a paper later showed that a simple Markov model is actually perfectly able to discriminate this grammatical sentence from an ungrammatical one. Novel and surprising sentences are never completely alien. They use familiar structures of the language, and combinations of words and other building blocks that we have seen before, and this is exploited when we analyze such sentences. Surprise and novelty are actually strongly related to statistics (cf. information theory).
Isn't that how we learn? We interpret odd words as gobbledygook until we look them up or find out how they're used.