Hacker News new | ask | show | jobs
by ChuckMcM 5490 days ago
No, it would not be hard. As someone who operates the backend of a web search company[1] I can tell you that if you crawl a site you can tell when the original went up and when the copies went up. At Blekko, we let you go an view it.[2] Now if you add to that the understanding that Google has to index your page in order to serve page relevant content (no sense putting ads for nipple piercing on a Christian advocacy site for example), they know they are serving an advertisement on duplicated content. (where 'know' here is defined to be they have all the data they need, at the time of serving up the ad, to algorithmically identify duplicate content.) When I worked there I got a pretty thorough look at how that part of the business worked.

They could flag the account and they could cut them off (its in their terms of service they can cut you off for any reason, and have done so to people in the past) but they don't. Given recent updates in their search ranking [3] they clearly can "identify" sites where this check would be implemented, but they choose not to. I speculate they don't for the same reason Apple doesn't look too hard or deeply at the working conditions at FoxConn, willfull blindness is a wonderful thing.

[1] I run operations at http://blekko.com

[2] http://blekko.com/ws/http:%2F%2Ftheoatmeal.com%2F+/domaindup...

[3] http://jeffmills.com/2011/01/29/duplicate-content-is-now-a-p...

3 comments

Sigh, I can't edit it, but yes the jeffmills pop underish thing is really annoying. Skip [3] and refer to this article from SearchEngineLand on the impact of their duplicate content detection:

http://searchengineland.com/your-sites-traffic-has-plummeted...

How do you know which is the oldest content? Are you polling web pages often enough that your spider can tell the time difference between the real content and the seconds old RSS scraped copy?

Just kidding, Blekko is great, althought I'm more of a DDG and Google user.

Sites can feed new content to Googlebot before publishing to a broader audience. This is commonly done in Google News... you will sometimes click on a link to an article that claims to be not yet published. It is annoying usually, but helpful in this case.
OT: Trying to close jeffmills.com resulted in some annoying results.