| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wuliwong 3503 days ago
	The algorithm would have to examine the HTML returned by the URLs to determine they are duplicate pages. There is no way to determine they are duplicate pages by just analyzing the URLs. It is not impossible to have two different posts to HN that are legitimately different content that have the same domain and even partially matching paths. In order to work in a dependable way, the algorithm would have to examine the HTML returned by the URLs to determine they are duplicate pages. In this case, that wouldn't even work because they aren't actually duplicate pages. In order to mark these two URLs as duplicate I think you might need to use some machine learning and then it would only yield some level of confidence that these were indeed duplicate pages as a result.

1 comments

xapata 3502 days ago

Or they could just special-case GitHub repos and their readme files. But what is duplicate detection if not machine learning? One can't be sure the same URL is the same document if submitted at different times.