Hacker News new | ask | show | jobs
by crystalgiver 3990 days ago
Why can you not just use the canonicalized URL to detect dupes? That is infinitely simpler than doing text analysis.
1 comments

It will work for simple cases like https vs http or other cases of URL normalization but won't work for complex cases where they refer to the same content but with different title.
I think it could work with the canonical tag[1], not the url itself.

[1] http://googlewebmastercentral.blogspot.com.ar/2009/02/specif...