Hacker News new | ask | show | jobs
by leephillips 4490 days ago
"Google is finally doing something about scraping"

I hope this is genuine and not a disingenuous diversion on Google's part. The fact that the Huffington Post still ranks very high for trendy searches makes me wonder.

As usual, follow the money: the scraping sites exist to make money, often through Google's advertising; Google gets a cut. The original content is often on sites with no advertising or real traffic, from which Google profits nothing.

EDIT: To expand on this: Google-search for any hot topic in the news, say the name of some misbehaving pop star. See the HuffPo result near the top of the page. Look down to see several results from real newspapers. This is where the original content can be found. Most of these newspapers are about to die because they're not making any money. HuffPo investors are filthy rich because they're gaming the search engines to profit from copy-and-paste.

ANOTHER EDIT: I apologize for my characterization of the Huffington Post. I was describing, accurately, the nature of that site as it was the last time I visited it some time before its purchase by AOL three years ago. The HuffPo I see today is utterly transformed. They use wire services, do plenty of their own reporting, and many of the links on the front page go directly to other news sites. They are no longer a copy-and-paste site.

8 comments

Huffington Post isn't a scraper site. Aside from the original content they produce, they republish blog posts with permission from the authors. If you have an example of Huffington Post literally cut-and-pasting content from someone without attribution, please share.

I also assume that by "HuffPo investors" you mean AOL? Huffington Post is a fully owned subsidiary.

(Disclosure: I consult for Huffington Post)

You are right. Please see my second edit in my comment.
Google-search for any hot topic in the news, say the name of some misbehaving pop star. See the HuffPo result near the top of the page. Look down to see several results from real newspapers.

Many newspapers get a lot of their content from syndication services like Reuters. You may be seeing similar content because lazy editorial assistants just copied out a reuters story verbatim, slapped a pic on it and put it up at multiple organisations, not because HuffPo is scraping other sites. Do you have an example of this sort of thing you can point to? It'd be interesting to trace the origin of the content.

That would be a pretty blatant copyright violation. Can you provide an example to substantiate the claim?
I covered this a few months back while it all went down with the Verge and HuffPo and how our social search engine algorithm accounted for this while Google did not.

http://theenginuity.com/blog/how-a-copied-excerpt-of-a-story...

You are correct, and the parent post before you is also correct.

Google's algorithm put a great deal of value on domain names, which provided a strong incentive for owners of a strong domain name to pump out low quality content. Low quality content can be the example you give, it can be "re-authoring" someone elses article, or is can be blatant copy and paste which is generally avoided due to the obviousness.

When a media property, such as the Huffington Post, pumps out this volume of low quality content the advertising revenue can subsidize the cost of paying for original journalist content.

On another note, take a look at The Daily Mail. They pump out timely news pieces so quickly that they are covered in typos and sometimes can't even keep left and right straight in photo captions.

There's a big difference between writing an article based on another article you read, and web scrapers.

Not that I particularly feel like defending the Huffington Post, but they're not a web scraper.

This is not Google activing on the community's behalf. It's Google doing a little CYA. It's one of those situations where Google's PR is trying to quell problems that dings their bottom line of advertising revenue. This was a mirror of the FB "official" advertising vs. "bot" advertising through things like Fiver.com. Still a sham on both sides rather than really make a big stink about it.
Is their copy and paste verbatim?
HuffPo definitely has original reporters, my friend is one of them:

http://www.huffingtonpost.com/betsy-isaacson/

She writes good articles for the general public about tech in general and things like net neutrality and aaron swartz