|
|
|
|
|
by IndexPointer
1620 days ago
|
|
Any simple heuristic has false positives, meaning they'll end up taking down legitimate sites that had repeated content for a good reason.
Say, for example two sites quoting text from the us constitution. The second one to be crawled would be considered to be spam copying the first one and removed from web results. Then you'll get comments on hacker news complaining that Google is censoring it for political reasons. And any simple heuristic is quickly reverse engineered by SEOs, who will find a way to mask it as legitimate. tl;dr it's a hard problem. |
|
As I have said, the reason they don't do it is not because they don't have the skills and know-how.