|
|
|
|
|
by h2odragon
929 days ago
|
|
Long while back, I had a big pile of numbers and I knew they could offer some meaning if only I could extract it. There's this whole discipline that advertises techniques for doing that, called "Statistics", so I looked there for lessons. What I found was "How to throw away data that doesn't support your desired conclusions," for the most part. "Actuarial Science," a different field, had some useful techniques but not many. They're most interested in ensuring the bad data doesn't get into the tables in the first place; but at least they are doing "data on data" comparisons and not "data to expectations" We're building "AI" right now but think about the inputs those see: The very first step is to throw away the statistically too common "stop words" ... |
|
What exactly are you referring to here? This seems like a wildly misguided characterization of statistics, which I am sure cannot be based in expertise or practical applied experience.
> We're building "AI" right now but think about the inputs those see: The very first step is to throw away the statistically too common "stop words"
This is a fundamental misunderstanding of what a "stopword" is and how it's used.
Words like "the" are hard to utilize within with a bag-of-words model specifically. Removing them is not something people do/did because they are clueless monkeys. The goal is to improve the signal-to-noise ratio.
For example, traditionally spam filtering uses a very crude variety of bag-of-words model called "Naive Bayes", in which we assume (wrongly of course) that word choice is completely random, and that the only difference between spam and not spam is that random distribution of words. Are you really going to argue that the word "the" is critical to that process? If you can build a better NB spam filter by including stop words, by all means go ahead and do it. But both linguistics and decades of success in the field are against you.
On the other hand, words with grammatical function like "the" are absolutely important and relevant to the overall structure and meaning of a document. Therefore, training pipelines for modern deep-learning-based LLMs like GPT don't remove stop words (as far as I know at least), because the whole idea of a stopword doesn't make sense in a model like that.
I want to be respectful here, but it sounds like you took a cursory look through three vast literatures, without the perspective of having actually used any of this stuff in real life, and drew some invalid conclusions.