Hacker News new | ask | show | jobs
by tropin 5490 days ago
I think the problem is it would be really hard for the Google web spiders to tell which is the real content and which the copy without human interaction.
4 comments

No, it would not be hard. As someone who operates the backend of a web search company[1] I can tell you that if you crawl a site you can tell when the original went up and when the copies went up. At Blekko, we let you go an view it.[2] Now if you add to that the understanding that Google has to index your page in order to serve page relevant content (no sense putting ads for nipple piercing on a Christian advocacy site for example), they know they are serving an advertisement on duplicated content. (where 'know' here is defined to be they have all the data they need, at the time of serving up the ad, to algorithmically identify duplicate content.) When I worked there I got a pretty thorough look at how that part of the business worked.

They could flag the account and they could cut them off (its in their terms of service they can cut you off for any reason, and have done so to people in the past) but they don't. Given recent updates in their search ranking [3] they clearly can "identify" sites where this check would be implemented, but they choose not to. I speculate they don't for the same reason Apple doesn't look too hard or deeply at the working conditions at FoxConn, willfull blindness is a wonderful thing.

[1] I run operations at http://blekko.com

[2] http://blekko.com/ws/http:%2F%2Ftheoatmeal.com%2F+/domaindup...

[3] http://jeffmills.com/2011/01/29/duplicate-content-is-now-a-p...

Sigh, I can't edit it, but yes the jeffmills pop underish thing is really annoying. Skip [3] and refer to this article from SearchEngineLand on the impact of their duplicate content detection:

http://searchengineland.com/your-sites-traffic-has-plummeted...

How do you know which is the oldest content? Are you polling web pages often enough that your spider can tell the time difference between the real content and the seconds old RSS scraped copy?

Just kidding, Blekko is great, althought I'm more of a DDG and Google user.

Sites can feed new content to Googlebot before publishing to a broader audience. This is commonly done in Google News... you will sometimes click on a link to an article that claims to be not yet published. It is annoying usually, but helpful in this case.
OT: Trying to close jeffmills.com resulted in some annoying results.
I don't know about that. for one-offs yes, but if you noticed serial content appearing on sites a, b, c, and d, then you can just poll them as a set periodically until you see which sites consistently originates it.
http://news.ycombinator.com/item?id=2541853

This seems to be about exactly this problem.

Why not take the oldest as the original and leave a little form for authors to tell google when they move their content?
Because you don't know which one is the oldest, but which was the first retrieved by the spider.
Also, I can return whatever I want in the http header.