Hacker News new | ask | show | jobs
by NY_USA_Hacker 5545 days ago
This article is a great example of why 'computer science' should not assume that they understand statistics or even much in data manipulation.

Then the article is an example of why computer people should be careful on where they learn their statistics!

The article is awash in hand wringing about "interval scale" and "ordinal scale" data without being at all clear on just why someone should care, and for all the rest of the article they should not care.

So, the article has:

"For ordinal data, one should use non-parametric statistical tests which do not assume a normal distribution of the data."

Mostly nonsense. In statistical testing, the normal distribution arises mostly just via the central limit theorem which has quite meager assumptions trivially satisfied by "Likert" scale data.

Then there is:

"Furthermore, because of this it makes no sense to report means of likert scale data--you should report the mode."

Nonsense: The law of large numbers has especially meager assumptions also trivially satisfied by Likert scale data. If you want to estimate expectation, then definitely use the mean and not the mode.

Beyond the law of large numbers, there is also the classic

Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.

that makes clear that the mean is the most accurate way to estimate expectation.

If you want to use the mode for something, then say what the heck you want to use it for and then justify using the mode as the estimator.

There is:

"In order to defend that ratings can be treated as interval, we should have some validation that the distance between different ratings is approximately equal."

Nonsense. Instead, you get a 'rating', say, an integer in [1, 5]. Now you have it. Use it. For

"validation that the distance between different ratings is approximately equal."

why bother? Besides, "the distance" is undefined here!

For the

"This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5."

the writer is just fishing in muddy waters.

There is

"All the neighbor based methods in collaborative filtering are based on the use of some sort of distance measure. The most commonly used are Cosine distance and Pearson Correlation. However, both these distances assume a linear interval scale in their computations!"

Nonsense. Just write out the definitions of expectation, variance, covariance, and Pearson correlation and see that sufficient is that the expectation of the squared random variables be finite. There is nothing about "interval scale" in the assumptions.

But why calculate Pearson correlation? When dig into that, again, basically just want some MSE (mean square error) convergence, which again makes no assumptions about "interval scale" data.

There is

"This is my favorite one... The most commonly accepted measure of success for recommender systems is the Root Mean Squared Error (RMSE). But wait, this measure is explicitly assuming that ratings are also interval data!"

Nonsense. There is no such assumption about MSE. The main point about MSE is just that any sequence of random variables (e.g., estimates) that converges in MSE will have a subsequence that converges almost surely. In practice, convergence in MSE is convergence almost surely, and that's the best convergence there can be. So, if your estimates are good in MSE, then essentially always in practice they are close in every sense. Nowhere in this argument is an assumption about "interval data".

This article sounds like 'statistics' from some psycho researcher who has an obsession about interval scales and a phobia about using ordinal scale data! In particular he has high anxieties about being charged with heresy by the Statistical Religious Police! The guy needs 'special help'!

Did I mention that the article is nonsense?

1 comments

Thanks @NY_USA_Hacker Yours is a great example of why smart people should also invest in improving communication and social skills. Half of your comment is nonsense. The other half does raise interesting points that would deserve a reply if they were written in a different tone. I would be happy to go into each of the points you mention if you decide to re-write the comment in a more constructive way.

Really, you can be positive, constructive and even happy in your life without sounding less smart by doing so.

Your response is to style, not substance.

But there isn't a lot of room to respond with substance because the article is, did I mention, nonsense.

There is a reason: Several paths led to some of the more central topics in probability and statistics. Such paths included gambling, astronomical observations, psychological testing, signal processing, control theory, quality control, 'statistical' physics, quantum mechanics, mathematical models in the social sciences, experimental design, especially in agriculture, mathematical finance, and more. In addition there is now a very solid, polished field of probability, stochastic processes, and their statistics,

Some of these paths got lost in the swamp on their way to some reasonably clear understanding. For the solid material, so far that is rarely taught: The prerequisites need quite a lot of pure math, and then the pure math departments rarely follow through with the probability, stochastic processes, and statistics.

Early in my career, I was dropped into parts of the swamp, but later I got the rest of the pure math prerequisites and good coverage of the solid, polished material.

So, at this point I see both the swamp and the solid, polished material.

Net, the paper is from the swamp, and I responded with just a little of the solid, polished material.

For the swamp, not a lot of discussion is justified. The best response is the one I gave: The stuff from the swamp is nonsense. That may sound harsh, but it's on the center of the target.

Good luck with your life out of the swamp. Looks like you are going to need it.