Hacker News new | ask | show | jobs
by kemiller2002 928 days ago
In my experimental design class in college, I remember the professor talking about the difficulties of dealing with data and what to include and what not. He pointed to a case where a datapoint looks like an anomaly and possibly should be removed. He showed the math behind it and including it means the experiment doesn't show a positive result and excluding it does. So which to do you do? If including it means you don't get funding for you and your team, what do you decide? This, of course, led into the ethics portion of the course, and how easy it is to go down a bad path, because you can manipulate data to make it say what you want.
3 comments

You're describing a legitimately hard problem faced by honest researchers. This case, it seems like we have enough evidence to suggest that we are not dealing with honest researchers, but rather deliberate fraud.
That's not a hard problem for an honest researcher. Just explain the risk of that data point in the grant application and if the funders decide not to take the risk, that's their perogative.
Or, if feasible, you get more data
Long while back, I had a big pile of numbers and I knew they could offer some meaning if only I could extract it. There's this whole discipline that advertises techniques for doing that, called "Statistics", so I looked there for lessons.

What I found was "How to throw away data that doesn't support your desired conclusions," for the most part. "Actuarial Science," a different field, had some useful techniques but not many. They're most interested in ensuring the bad data doesn't get into the tables in the first place; but at least they are doing "data on data" comparisons and not "data to expectations"

We're building "AI" right now but think about the inputs those see: The very first step is to throw away the statistically too common "stop words" ...

> What I found was "How to throw away data that doesn't support your desired conclusions," for the most part.

What exactly are you referring to here? This seems like a wildly misguided characterization of statistics, which I am sure cannot be based in expertise or practical applied experience.

> We're building "AI" right now but think about the inputs those see: The very first step is to throw away the statistically too common "stop words"

This is a fundamental misunderstanding of what a "stopword" is and how it's used.

Words like "the" are hard to utilize within with a bag-of-words model specifically. Removing them is not something people do/did because they are clueless monkeys. The goal is to improve the signal-to-noise ratio.

For example, traditionally spam filtering uses a very crude variety of bag-of-words model called "Naive Bayes", in which we assume (wrongly of course) that word choice is completely random, and that the only difference between spam and not spam is that random distribution of words. Are you really going to argue that the word "the" is critical to that process? If you can build a better NB spam filter by including stop words, by all means go ahead and do it. But both linguistics and decades of success in the field are against you.

On the other hand, words with grammatical function like "the" are absolutely important and relevant to the overall structure and meaning of a document. Therefore, training pipelines for modern deep-learning-based LLMs like GPT don't remove stop words (as far as I know at least), because the whole idea of a stopword doesn't make sense in a model like that.

I want to be respectful here, but it sounds like you took a cursory look through three vast literatures, without the perspective of having actually used any of this stuff in real life, and drew some invalid conclusions.

> I want to be respectful here,

Thanks!

Many people in these fields agree my conclusions are invalid. I say the same about theirs.

You're entitled to your own opinion of course, but your conclusions appear to be based on beginner-level misunderstandings. That doesn't seem like a constructive or productive way to conduct oneself through life.
you should see my rants about why normalizing weights is a bad idea and how a limited context window is effectively random interpolation
There are so many complications because of data fraud and the fear of perception of data fraud.

I'm the guy who builds the experiments on a team of user researchers. There are all sorts of things that seem intuitive to an outsider but are poo-pooed by practitioners as unethical. For instance, you might run a study that doesn't have enough participants to have a statistically significant conclusion. An outsider would deploy it to more participants to see if the trend becomes significant with more data. A trained researcher will cringe at that proposal.

So far as I can tell, researchers consider the experiment final as soon as you peek at the data. If you want any changes - more data, different demographics, etc - you have to throw out everything and start over. Even though it's logically interchangeable, the data you've already collected is considered spoiled, because they don't want allegations of tampering/data grooming.