Hacker News new | ask | show | jobs
by progolferyo 5400 days ago
Interesting article. I don't understand the filtered review system at all. Beyond the 'he said / she said' complaints that occasionally come out, there are things about their system that simply don't make any sense unless Yelp is incompetent or slimy. For example:

- When you post a review, you as a reviewer think its unfiltered forever. When you revisit the page as a logged in user and read a place that has your review, your review is visible. When you log out or log in as another user, the review is filtered and hidden. At the very least, it should tell you your review is filtered, I see no reason to pretend the review is not filtered when the review is legitimate.

- When you view unfiltered results, the per page number mysteriously changes to 10 per page. I don't see any reason why this should change. Plus the results are pretty slow to load, quite slower than the results for filtered reviews.

- Why do you need to enter in a captcha to view the unfiltered reviews? Why would they care if you were a bot only for the unfiltered reviews and not the normal reviews? I don't see the difference, unless they want to prevent people from writing scripts to pull in unfiltered review data. Plus the captcha is fucking horrible, literally half the captcha's I get are not readable and I need to refresh.

- The filter algorithm seems to be clearly flawed and simply catches way too many reviews that should not be filtered. For example, take this user: http://www.yelp.com/user_details?userid=tZlbsUVo-8wtnR7oMa-3... . The guy has 11 reviews, 1 1-star review, 1 2-star review and nothing out of the ordinary and yet his review about Yelp was filtered. Why? His points in the review seemed legitimate. He seems to be a normal user, not a new user and posts reviews across the board (more good reviews than bad in fact). They should either fix the algorithm or be more transparent about why reviews are filtered because I can't understand why a review like that is filtered.

4 comments

> When you view unfiltered results, the per page number mysteriously changes to 10 per page. [...] Plus the results are pretty slow to load, quite slower than the results for filtered reviews.

Caching, I'm sure most unfiltered reviews are cached whereas filtered reviews are not and reaching out past the cache can be expensive. One way to mitigate this is to reduce the number of results you pull.

> Why do you need to enter in a captcha to view the unfiltered reviews? Why would they care if you were a bot only for the unfiltered reviews and not the normal reviews?

If you can write a script to deduce the filtering algorithm then you can by definition write reviews that thwart it. With less data, it is harder to deduce the filtering algorithm. In other words, a captcha thwarts high-volume review fraud.

> The filter algorithm seems to be clearly flawed and simply catches way too many reviews that should not be filtered.

I think most people seem to underestimate the difficulty of the problem. Unlike e-mail spam, which is easy for a human to spot, fake reviews are very hard for a human to spot. How can you tell if a consumer was provoked into writing a positive review so that they could get a few bucks off their order just from their writing? You can't, you can look at other statistical trends behind such reviews (such as a sudden wave of positive reviews), but you're only looking for side effects of the primary problem and thus you will never achieve perfect performance from a method like this.

Yelp takes the (somewhat philosophical) viewpoint that customers who are coerced into writing a review are less genuine than they would be otherwise. I believe that this view drives a lot of their algorithm and possibly threatens its accuracy in a way that is ultimately not worth it. I think there are a number of things that Yelp could do to make the users trust in reviews greater that don't involve filtering - one simple thing would be for a user's review of an Indian restaurant to show me that user's breakdown of reviews of other Indian restaurants.

TL;DR: This is a much harder problem than it seems at first glance, partly because of the nature of the problem and partly how Yelp has framed it for themselves.

Disclaimer: I used to work at Yelp, but no longer do. Everyone I worked with were stand-up guys.

I guess most of your points make sense. I just feel like Yelp does very little to be open and transparent. I get that its very difficult, nobody ever said its easy to algorithmically guess review spam.

But they clearly don't want users to see unfiltered reviews. A tiny gray link below all 40 reviews, then a captcha (or two or three) and then a slow user experience before you can see the filtered reviews is lame.

I agree with you about the showing other reviews of the same subject, that would be neat. I guess if I were Yelp, I would try harder at standing up for their algorithms and show more data about why they work and why we are better off having their amazing algorithms.

I had an experience a year or so ago with a friend who started a moving service in SF. A couple of months after he started the business, he noticed he received a review on Yelp from some dude that said during a moving job, the guy took a smoke break and peed all over the sofa he was moving. Not only was the story ridiculously false but my buddy had no idea who the reviewer was. The review did however NOT get filtered, even after he responded to the review and contacted Yelp. And he was stuck with this crazy review at the top of his profile. This went on for months and it really damaged his credibility, meanwhile he would have positive reviews from legitimate customers who would naturally have a newer profile or whatever and the reviews would get filtered. It just seems like Yelp should be more sophisticated. (And yes, they are 10000% better than TripAdvisor)

Thank you newhouseb! This is the most intelligent comment I've seen on the issue, by far.

(I hope this comment makes it through the filter, I swear it's not a fake... I don't even know newhouseb...)

"At the very least, it should tell you your review is filtered, I see no reason to pretend the review is not filtered when the review is legitimate."

I can explain that for you! If you're gaming the system and Yelp catches you, they don't want to tell you they've caught you, or you'll just try again. That's also why you need to enter in a captcha to view the unfiltered reviews -- they want to prevent automatic methods of confirming that posts were filtered.

Apparently this isn't common knowledge -- the OP makes the same mistake:

"Interestingly enough, you have to pass a reCaptcha when clicking on it. Weird, I don't have to do that to view unfiltered reviews. Interesting...."

EDIT: ps, the user you linked to doesn't have a name. I'm not saying his stuff should be filtered, but keep in mind when you're looking at this stuff that a million factors go into that decision, and it's probably in their customers' best interests to err on the side of filtering a review.

But there are easy methods to see if your review was filtered - you just look for it with a different account.

This only affects people who have written legitimate reviews, that have been filtered. It does nothing to prevent an automated system from doing anything.

That review that was filtered contained swearing. Does Yelp filter reviews like that? I'm sure there are kids that use the site, so that's a possibility.

You need to find more cases like this. Just one data point is not enough to draw any useful patterns.

"Why do you need to enter in a captcha to view the unfiltered reviews?"

Maybe they don't want Google to see them...

That's what robots.txt is for, not silly captchas. You put captchas for the people who ignore that file.
I think the idea is that given a large corpus of filtered and unfiltered reviews, you might be able to reverse engineer signals in the algorithm and game the system. If that's your end goal, you and the software you write is likely to ignore robots.txt directives.
Not all spiders honor robots.txt
Plus captcha's are just such a stupid user experience anyway, if you want to avoid the robots problem, there are plenty of ways around captchas
Like? Yelp doesn't want the filtered reviews to be accessed in an automated fashion, that's what a CAPTCHA does. What are the other options?
A quick and dirty solution could be to add something to the page using javascript after the page has loaded and only let the link work if that variable exists (and check the value of the key with the server, if you wanted to be more cautious). Not a complete solution, but a first step and invisible to the user (and a pain in the ass to a robot)
A robot scraping yelp's deep data is going to be site-specific, and having to scrape another javascript variable is not much more than a slight speed-bump.
Scraping with a normal web browser is utterly trivial. Anything a computer can do, a computer can do. Hence CAPTCHAs.
Yelp does not just want to make scraping impossible, they seem to also be interested in making it harder for humans to view filtered reviews.