Hacker News new | ask | show | jobs
by dntrkv 3222 days ago
I've been saying this for years, thumbs up/down is the only system that makes sense to me.

Foursquare uses it and I've found their scores to be way more useful than Yelp's.

The biggest problem with star ratings is that it's so arbitrary. What is the difference between 3 and 3.5? What is a 1 vs a 2? 3/5 is 60%, that's almost failing when you think about it on a grading scale, if I scored something as a 3/5 I would never use that product or service again, yet, many of the best restaurants are rated 3/5 on Yelp.

Unless the user has some scoring system in place for different qualities of the product or service, there is no way you can get anything resembling an accurate score.

I would never trust a user to accurately assess a score given 10 different options (.5-5) but I would be way more likely to trust a user to say either "I like this product" or "I do not like this product."

But yes, the Wirecutter approach works great, but it just doesn't scale.

7 comments

Counterpoint: I almost solely rely on the stars histogram in Yelp (available only on the website, not the app), completely ignoring whatever Yelp's calculated "average" is.

If a place has more 5-star ratings than 4-star ratings, it's generally amazing. If it has more 4-star ratings than 5-star ratings, it's generally fine but not something particularly special.

Just thumbs up/down would eliminate what is, to me, the single most useful aspect of Yelp.

It doesn't matter that star ratings are arbitrary -- when you average enough of them out, a clear signal overrides the noise. You can distrust any given user, while still trusting the aggregate.

(Curiously enough, I don't find any equivalent value on Amazon. On Yelp, you're really evaluating an overall experience along a whole set of dimensions, so there's a lot more to discriminate on. On Amazon, it does seem to be more of a binary evaluation -- does the product work reliably or not?)

I used to think the same thing until I realized the most accurate and consistent ratings I use on a regular basis is rotten tomatoes. And they're based on strict thumbs up/ down.

It ensures votes hold equal weight and that "extreme polar" voters don't skew things. It also avoids the opposite problem of "everything is neutral" vote unless horrible/incredible.

RT also handles high brow and low brow well. You get less voting of "eh I didn't love it, but it's sophisticated so I'll give it an extra star."

I'm sold on simple up/down.

Rotten Tomatoes is good and predicting a movie I (or others) like, but not really at "ranking". Zootopia, one of their top movies of 2016 and a 98% rating, is a good movie, but one I'm unlikely to pursue again. The Godfather (with a 99%) rating, is a movie I will pick up on Blu Ray and revisit many times. It's far more than 1% better than Zootopia.

So RT is good at predicting "should I watch this movie I haven't watched before", but bad at predicting more sophisticated habits or preferences. I wouldn't buy the Blu Ray off a RT prediction, but I would rent.

So it becomes a question of what are you trying to accomplish? For some issues up/down is a good way to solve a problem, for others it isn't.

Rotten Tomatoes actually has both ratings, meaning they recognize the limitation you're referring to. In the other, Zootopia has 8.1/10 and The Godfather has 9.2/10, showing that difference in quality.
Also you just aren't the demographic for zootopia. If you have kids then it probably is worth buying and they will watch it many times. There are so many genres of films, it's best to compare within a single genre and not between.
> Rotten Tomatoes is good and predicting a movie I (or others) like, but not really at "ranking". Zootopia, one of their top movies of 2016 and a 98% rating, is a good movie, but one I'm unlikely to pursue again

It feels like you're mixing together two different arguments. Rotten Tomatoes is good at predicting whether someone will like a movie. What is "ranking"? That is a very undefined concept. Ranking of what? It's clearly not ranking of likelyhood of a person liking a movie because rotten tomatoes already does that.

Later you mention likelihood of repeat watchings of a movie. Rotten Tomatoes thumbs up or down based on whether someone liked a movie, as a result it produces a metric on likelihood of someone liking of movie. Instead if rotten tomatoes immediately after watching a movie, asking "Did you like this movie?", asked "Would you watch this movie again?" then it would produce an indicator of re-watchability.

Up/down doesn't matter - it's the question that's being asked.

note the caveat RT obviously doesn't actually ask critics these questions, they read and judge their reviews and interpret them as answering those questions.

In my experience, my favorite movies I find via glowing reviews. Rotten tomato completely obscures this view: if all the reviewers kind of like it, it'll get 100%, whereas polarizing films always suffer. I'll take "kids" over "star wars" any day for a better movie. Why? I'm gonna see star wars because i want to, not because I expect a meaningful aesthetic. But Rotten Tomatoes takes the opposite tact, pushing me towards crowd favorites rather than what i might rate highly.

Really this comes down to how terrible one dimensional comparisons are: it only measure popularity, which is a terrible filter for quality.

I used to religiously research movies on RT - with a lot of success in my mind. With the user rating, the critic rating, and the "top" critic rating, you can infer a surprising amount about who is going to like any given film, and you learn over time where you fall on the critic/top critic/audience graph.

Recently, however, it seems like more (imo undeserving) movies that are "just ok" - like decent, but nothing special, romantic comedies and big blockbusters - are scoring above 90%. I might be being curmudgeonly about it, but I've nearly stopped checking it because it feels like there's no information there. My theory is that this started happening once Roger Ebert died... without such a leader in the field, no one is willing to say they didn't like a film unless it's obviously very bad.

I pay a lot of attention to histograms when there are many high-rated options for the same Amazon product type. A histogram that curves sharply in its number of 5-star reviews to almost nothing on the other end is the product you want (ignoring fake reviews for the sake of this conversation).

Amassing a bunch of 4- and 5-star ratings is easy, but leaving nothing for even the most habitual of complainers to complain about? That's an monumental achievement.

For things like books, I also find that reading the middling reviews often gives the best S/N ratio. It weeds out the fanboys and weeds out those who were clearly not the audience for the book (or just have some ax to grind). You're more likely to get the "I really love this author in general but I didn't care for this book because 1.) 2.) 3.)."
Agreed. For products in Amazon above a certain star threshold (say, 3+), I evaluate given the shape of the review histogram, particularly minimizing the size of the bump down at 1-star and 2-star.
If the provider is in a position to provide a prediction, then the rating system is useful. For example, on Netflix I used the Hated It, Didn't Like It, Liked It, Really Liked It and Loved It system. When they predicted a star rating, it was pretty close. When they said we predict you'll give this a three star rating(which is probably well below the "average") that was generally a movie I liked.
Which, in practice, tends to devolve to what's effectively a four-star rating of some sort: Want two hours of my life back, OK/meh, Good, Excellent

A humorous take: https://xkcd.com/1098/

But my point is that for me, it didn't. Netflix's system was good enough to take into account that people have different systems. Thus when Netflix says "we predict you'll give this 3 stars", that means it was a movie I would like. That might mean you gave it 4 stars or 2 stars or whatever, even though you liked it as much as me. They made my system the only one that matters, as long as I was consistent. Reviews in aggregate are pretty much meaningless, but a good system weighs that problem in.
Perhaps the issue isn't the granularity of a single dimensional rating scale, but the lack of expressive options when in reality your feeling about something is complex and multifaceted.

I've been really interested in the idea of emotive reviews as an alternative to single dimensional scores. The best idea I have at the moment is something akin to emoji reactions like you see on GitHub issues, finding a way to encode some feelings relevant to product reviews in a mechanism like that seems really intriguing to me.

I envision a panel of emoticons akin to the Facebook reaction set, but where the user can select as many as they want to quickly convey different combinations of their reactions:

    (thumbs up)      I liked this
    (heart)          I loved this
    (thumbs down)    I didn’t like this
    (smiling face)   This made me happy or satisfied
    (frowning face)  This made me sad or disappointed
    (surprised face) This made me surprised or impressed
    (angry face)     This made me angry or frustrated
Of course, it gets complicated. Did Sam U. Zerr give that product an (angry face) because they used it and didn’t like it, or because they’re offended that you would recommend it, or what?

If you’re only using icons to make recommendations to an individual user based on their own history, maybe you don’t need to infer the actual meanings; you can add all sorts of icons without any particular meaning and just make recommendations by correlation:

    (thinking face)  I’m considering this / I’m confused by or dubious of this
    (gear)           This was useful / this made me think
    (fire)           This album was great / this sauce was spicy
    (heart eyes)     I really want this / this is adorable
    ...
E.g. a recommendation for me might be “(thumbs up)(gear)(heart eyes)” because some product or content is similar, by some hidden metrics, to other things that I’ve reacted to in those ways.

Just brainstorming here. There are obviously many possible approaches in this space.

Put differently, a set of binary choices: amusing, interesting, sad, ... It's a bit difficult to come up with a good set to rate any thing, but I can see it working for specific topics, like movies or games.

Or, one could just let users tag the subject and the interface would display the "weights" of the tags.

That's part of the problem here. Appropriately rating different types of things differ in various ways.

A simple utilitarian object? It mostly works or it doesn't.

A movie? Just to start with, there's the rating of the movie itself vs. the rating for this particular DVD. And then there are the dimensions on which the movie itself could be rated.

Or you just throw your hands up in the air and either do a thumbs up/down or a 5 star rating system on the grounds that it's better than nothing.

How about vision based emotion recognition of viewers with cameras in the televisions and monitors? Sure sounds creepy and behaving different when observed etc. But I believe people will forget they are "observed" so the effect dimishs after a time. Than we would have a quite honest emotional feedback for movies. Even for specific scenes, for advertisment, etc
To be honest, the fact that 60% is a failing grade is a failure of the grading system, not a fact to take for granted. We've basically lost the entire dynamic range of 0-60% for no good reason.
I would actually say it's often not strict enough. In what serious field is it acceptable to only know, say, 70% of the material? Do you want to drive on a bridge designed by an engineer who only got 70% on their exams? It depends on how the test is structured, really, but unless it was one of those tests designed to bring smart people to their knees, I'd rather not.
We probably cross bridges designed by engineers who only got 70% on their exams all the time. That was pretty satisfactory score when I was in Uni.
Yeah - Exam performance from a decade or two ago is quite irrelevant for evaluating senior design engineers.

I wouldn't trust an engineering graduate who scored 100% on all their exams to design a bridge at all. Where as someone with 10+yrs relevant experience but who got 60-70% in their exams would be preferable to me.

Mastery of the math isn't that relevant due to all the design standards you have to understand and comply with anyway, while all the little pragmatic solutions to real world constraints (incl how the builders work and what they need to be effective) learnt from experience and mentoring from your senior peers are far more important.

So you're saying that the measurement of student mastery has no noise floor?
It depends on how things are graded. On a multiple-choice test with four choices per question, someone with no knowledge who guesses randomly will get ~25%. On a true-false test, someone with no knowledge gets ~50%. On a project graded by a human, or a worksheet whose answers are real numbers, someone with no knowledge and a hard-eyed grader might well get 0%. Different classes will have different proportions of these things that contribute to the overall grade (at least, I haven't heard of any requirement that classes have the same proportions of such). The simple approach of summing total points achieved over each graded item, divided by total points possible, is straightforward to calculate, but I think there's no mathematical justification for choosing one percentage-based grading scale and applying it uniformly to all classes.
>On a multiple-choice test with four choices per question, someone with no knowledge who guesses randomly will get ~25%.

That would be terrible test design. At my (German) university, most Multiple Choice tests give one point for a correct answer, minus half a point for a wrong answer. That way you expect negative points from people who think they know everything but are no better than random guessing, zero points from somebody who knows nothing, some points from someone who can always narrow it down to two choices.

I guess my point is that you can arbitrarily raise the floor with a bad grading scheme, but there's no inherent reason to do that.

Yup. You can't assume that what you think a 3/5 means is the same as what someone else thinks a 3/5 means. You can basically assume that for thumbs up/down. And really, the question you want answered is "how likely am I to like this", and thumbs-up % of the overall population is a decent proxy for that for good reason.
What does a thumbs up mean, though? In a netflix context, am i recommending it to others? Trying to train the recommendation for my own taste? Making sure i rewatch it if I don't remember watching it the first time? What do I do if i like a movie but it's objectively terrible? All of the above questions weigh heavily, and the end result is I just avoid binary voting systems (including voting on hn) and it becomes feature bloat with little use.

Strangely, it gets even harder with the thumbs down—there are vanishingly few things i actively wish didn't exist. Why downvote at all?

If I see an approve/disapprove button, I try to click it if it's for something I've chosen to consume (watch, buy, visit, etc). If it's a decision I'm glad I made, I thumb it up. If it's a decision I regret making, I thumb it down. People and systems will read that input for one of two ways: either optimizing stuff for my preferences, or using that data to make choices further in line with my preferences. Either way, the world is marginally more like I like it.
Right, but what about consumers that want the rating to be meaningful? Assumably netflix has a history of videos you've seen entirely; they don't need your rating to know you consumed it.

Personally I just stop watching the moment I feel regret—the thumbs down button has no role in how I consume.

Two star ratings, though—that is meaningful, at least to me.

You may stop watching a movie on Netflix because you do not like it. You may also stop watching a movie on Netflix because you already saw it multiple times and only wanted to rewatch few minutes snippet from it.

Without your thumbs up/down feedback it is hard for Netflix to figure out what is your opinion about the movie.

Just normalize ratings. If the average rating is in the 50th percentile of all ratings on the site, convert the rating to 50%. That way it carries the maximum possible information. If someone rates something 60% that just means it's better than 60% of similar products.

School grading systems serve a completely different purpose and are a terrible comparison.

What about a thumbs up, thumbs down, and a neutral? In the case of restaurants, there are plenty of places I've eaten where I wouldn't give them a thumb's up "best place ever", but also not deserving of a thumb's down "terrible."
This really depends on how thick or thin the data are.

If any given option only gets a small handful of votes, then you might see a strong bias (favourable or otherwise) where neutral would be appropriate.

In Likert scale design (where favourability options >2), there's a strong debate over even or odd choices -- should someone be able to give a "meh" rating, or do you want to force a positive or negative, if even slight.

Hence, 3, 4, 5, 6, and 7 point (typically) scales.