| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PaulHoule 3887 days ago

People have been complaining about this problem for a LONG time.

Duplicate removal is essential for making a web search engine that works. For instance, together with a CS research group, I built a search engine for a major university library that had more than 80 web sites. We found huge amounts of duplicate content produced by various mechanisms (for instance, multiple people posted the same stuff to the web.) If your ranking is content-based, all of the duplicate documents are going to rank the same and form a "plug" that excludes other documents.

It has long (post 2006) been a common story that "I wrote a blog post but somebody else ranks for it." For instance, I made a blog post that got a huge amount of traffic in the day, but right now you search for it and you find a presentation from some fresher at Oracle that is based on those ideas.

There are many factors that make this hard to control and these include: (1) for one "real" origin there are probably ten or a hundred fakes, so if you are picking at random you strike out -- you have to not only outrank one fake you have to outrank all the fakes, (2) freshness... copies are fresher than the original, also they can be updated years later, (3) also the bad guys think a lot more seriously about indexation, Page Rank, and other variables they control than do most content creators.