| HN Mirror

> What I found was "How to throw away data that doesn't support your desired conclusions," for the most part.

What exactly are you referring to here? This seems like a wildly misguided characterization of statistics, which I am sure cannot be based in expertise or practical applied experience.

> We're building "AI" right now but think about the inputs those see: The very first step is to throw away the statistically too common "stop words"

This is a fundamental misunderstanding of what a "stopword" is and how it's used.

Words like "the" are hard to utilize within with a bag-of-words model specifically. Removing them is not something people do/did because they are clueless monkeys. The goal is to improve the signal-to-noise ratio.

For example, traditionally spam filtering uses a very crude variety of bag-of-words model called "Naive Bayes", in which we assume (wrongly of course) that word choice is completely random, and that the only difference between spam and not spam is that random distribution of words. Are you really going to argue that the word "the" is critical to that process? If you can build a better NB spam filter by including stop words, by all means go ahead and do it. But both linguistics and decades of success in the field are against you.

On the other hand, words with grammatical function like "the" are absolutely important and relevant to the overall structure and meaning of a document. Therefore, training pipelines for modern deep-learning-based LLMs like GPT don't remove stop words (as far as I know at least), because the whole idea of a stopword doesn't make sense in a model like that.

I want to be respectful here, but it sounds like you took a cursory look through three vast literatures, without the perspective of having actually used any of this stuff in real life, and drew some invalid conclusions.