|
|
|
|
|
by YeGoblynQueenne
3696 days ago
|
|
The problem with statistical information is data sparsity. You could read all English texts ever written (or spoken for that matter) and the number of meaningful combinations left to see would still be infinite. If you try to learn language only from finite examples, you'll never see enough of it to learn it well. That's why Google reports results against the Penn trrebank. It's not even clear what's a good metric outside of finite corpora (that the field has been overfitting to for decades like someone noted above). |
|
If you think about it, there is an iteration happening within machine learning that is essentially building that prior knowledge about the world by reusing previous models as inputs to knew ones. For example how Spacy uses word2vec vectors to do parsing and NER and then sense2vec uses Spacy pos tags create word vectors.
sense2vec.spacy.io