|
|
|
|
|
by blahblahblah
5160 days ago
|
|
In addition to the inter-rater reliability issue, there are also a lot of unanswered questions about the statistical distributions involved. The results are reported as population means, but without information about the underlying distribution of the results it's unclear whether the mean is a meaningful measure of central tendency for the data or how much overlap there was in the distributions. How did the mean compare with the median and mode? What were the standard deviations? Interquartile range? They're using a visual analog scale for the ranking which is reasonable, but it seems that it's just been assumed that the data can be treated as interval data for the analysis and the validity of that assumption hasn't been established. If I were doing the analysis I'd have been inclined to bin the data and report the results as odds ratios with 95% confidence intervals (e.g. people wearing glasses are N + or - 95% CI times more likely to be regarded as "smart", where "smart" is defined as a score >= some reasonable threshold on the "smartness" axis than those without glasses). |
|
Which is especially problematic since user generated ratings are ordinal, not interval data. Since the idea of an interval between points in ordinal data is essentially meaningless the summary statistics you mentioned are not meaningful either.
It's one thing for Amazon to come up with a mean user rating to give you a sense of how people like something, but it's not a valid method of comparing the data we have here, especially when the differences are so small