| This article is a great example of why 'computer science' should not assume that they understand statistics or even much in data manipulation. Then the article is an example of why computer people should be careful on where they learn their statistics! The article is awash in hand wringing about "interval scale" and "ordinal scale" data without being at all clear on just why someone should care, and for all the rest of the article they should not care. So, the article has: "For ordinal data, one should use non-parametric statistical tests which do not assume a normal distribution of the data." Mostly nonsense. In statistical testing, the normal distribution arises mostly just via the central limit theorem which has quite meager assumptions trivially satisfied by "Likert" scale data. Then there is: "Furthermore, because of this it makes no sense to report means of likert scale data--you should report the mode." Nonsense: The law of large numbers has especially meager assumptions also trivially satisfied by Likert scale data. If you want to estimate expectation, then definitely use the mean and not the mode. Beyond the law of large numbers, there is also the classic Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946. that makes clear that the mean is the most accurate way to estimate expectation. If you want to use the mode for something, then say what the heck you want to use it for and then justify using the mode as the estimator. There is: "In order to defend that ratings can be treated as interval, we should have some validation that the distance between different ratings is approximately equal." Nonsense. Instead, you get a 'rating', say, an integer in [1, 5]. Now you have it. Use it. For "validation that the distance between different ratings is approximately equal." why bother? Besides, "the distance" is undefined here! For the "This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5." the writer is just fishing in muddy waters. There is "All the neighbor based methods in collaborative filtering are based on the use of some sort of distance measure. The most commonly used are Cosine distance and Pearson Correlation. However, both these distances assume a linear interval scale in their computations!" Nonsense. Just write out the definitions of expectation, variance, covariance, and Pearson correlation and see that sufficient is that the expectation of the squared random variables be finite. There is nothing about "interval scale" in the assumptions. But why calculate Pearson correlation? When dig into that, again, basically just want some MSE (mean square error) convergence, which again makes no assumptions about "interval scale" data. There is "This is my favorite one... The most commonly accepted measure of success for recommender systems is the Root Mean Squared Error (RMSE). But wait, this measure is explicitly assuming that ratings are also interval data!" Nonsense. There is no such assumption about MSE. The main point about MSE is just that any sequence of random variables (e.g., estimates) that converges in MSE will have a subsequence that converges almost surely. In practice, convergence in MSE is convergence almost surely, and that's the best convergence there can be. So, if your estimates are good in MSE, then essentially always in practice they are close in every sense. Nowhere in this argument is an assumption about "interval data". This article sounds like 'statistics' from some psycho researcher who has an obsession about interval scales and a phobia about using ordinal scale data! In particular he has high anxieties about being charged with heresy by the Statistical Religious Police! The guy needs 'special help'! Did I mention that the article is nonsense? |
Really, you can be positive, constructive and even happy in your life without sounding less smart by doing so.