Hacker News new | ask | show | jobs
by YeGoblynQueenne 3696 days ago
The problem with statistical information is data sparsity. You could read all English texts ever written (or spoken for that matter) and the number of meaningful combinations left to see would still be infinite. If you try to learn language only from finite examples, you'll never see enough of it to learn it well. That's why Google reports results against the Penn trrebank. It's not even clear what's a good metric outside of finite corpora (that the field has been overfitting to for decades like someone noted above).
1 comments

Prior knowledge solves that problem. A human encounters the same sparsity a computer does when learning from text but prior knowledge allows us to connect rare features to a larger model in which they are, in a way, less rare.

If you think about it, there is an iteration happening within machine learning that is essentially building that prior knowledge about the world by reusing previous models as inputs to knew ones. For example how Spacy uses word2vec vectors to do parsing and NER and then sense2vec uses Spacy pos tags create word vectors.

sense2vec.spacy.io

>> Prior knowledge solves that problem.

Prior knowledge _might_ solve that problem. It's not really solved yet so who knows. Yeah, work is ongoing and word vectors sound cool and all, but in the past people said the same thing about bag-of-words models and look where we are now.

Humans solve sparsity, sure, we learn language from ridiculously few data points, but who knows what it is that we do, exactly? If we knew, we wouldn't be discussing this.

Let's restate the problem to make sure we're talking about the same thing: the problem is that the number of possible utterances in a given language that are grammatically correct according to some grammar of that language is infinite (or so big as for it to take longer than our current universe has to live before an utterance is repeated).

And it's a problem because it's impossible to count infinity given only finite time. I don't see how prior knowledge, or anything else, can solve this.

Which must mean humans do something else entirely, and all our efforts that are based on the assumption that you can do some clever search and avoid having to face infinity, are misguided and doomed to fail.