|
|
|
|
|
by gabelschlager
1203 days ago
|
|
Well, Chomsky already dismissed corpus based linguistics in the 90s and 2000s, because a corpus (large collection of text documents, e.g., newspaper, blog post, literature or everything mixed together) is never a good enough approximation of the true underlying distribution of all words/constructs in a language.
For example, a newspaper-based corpus might have frequent occurences of city names or names of politicians, whereas they might not occur that often in real everyday speech, because many people don't actually talk about those politicians all day long. Or, alternatively, names of small cities might have a frequency of 0. Naturally, he will, and does, also dismiss anything that occured in the ML field in the past decade. But I agree with the article. Dealing with language only in a theoretical/mathematical way, not even trying to evaluate your theories with real data, is just not very efficient and ignores that language models do seem to work to some degree. |
|