Hacker News new | ask | show | jobs
by zolloie 3221 days ago
I do research in this area and have many reactions to a lot of topics being brought up. I read this piece when it first was written and didn't think to look at the posting on HN until now.

The problem with dichotomous ratings (binary, thumbs up-down) is that they lose a lot of meaningful information without eliminating the problems you're referencing.

That is, the same problems apply to dichotomous ratings, in that people still have tendencies to use the rating scale differently. Some tend to give thumbs up a lot, others down, and people interpret what's good or bad differently. People who are ambivalent split the difference differently.

On top of that, you lose the valid variance in moderate ranges, and actually amplify a lot of these differences in use of the response scale, by forcing dichotomous decisions, because now you've elevated these response style differences to the same level of the "meaningful part" of the response. E.g., maybe one person tends to rate things more negatively than another person, rating 4 and 5 respectively. But when you dichotomize, now that becomes 1 and 2.

The question is whether or not, on balance, the variance associated with irrelevant response scale use is greater than the meaningful variance, and generally speaking studies show the meaningful variance is bigger. In general, you see a small but significant improvement in rating quality going from 2 to 3, and from 3 to 4, and then you get diminishing returns after 4-6 options.

Also, people really don't like being forced to take ambivalence and choose up or down, so in the very least having a middle option is better (unless you want to lose ratings).

It's fairly straightforward to adjust for rating style differences if you have a bunch of ratings of an individual on a bunch of things whose rating properties are fairly well-known. Amazon could do this if they wanted to, and Rotten Tomatoes I think might do something like this already.

RT, in fact, is kind of a bad example, because their situation is so different from typical product ratings, in that you have a small sample of experts who are rating a lot of things. They also are aggregating things that themselves are not standardized-- their use of the tomatometer in part stems from them having to aggregate a wild variety of things, as if everyone on Amazon used a different rating scale, or no rating scale at all. Note too that there's then a "filtering" process involved by RT. Finally I also feel obliged to note they do have ratings and not just the tomatometer, which I've started paying attention to after realizing that things like Citizen Kane show up as having the same tomatometer score as Get Out--a fine movie but not the same.

The game theory angle is interesting to think about. It's something I don't deal with usually because in the situation I'm used to, the raters don't have access to other rater's ratings. That's one solution, but impractical. A sort of meta-rating is one solution--a lot like Amazon's "helpfulness" ratings. It's imperfect but probably does well in adjusting for game theory-type phenomena, like retaliatory rating, etc.