|
|
|
|
|
by wuliwong
3503 days ago
|
|
The algorithm would have to examine the HTML returned by the URLs to determine they are duplicate pages. There is no way to determine they are duplicate pages by just analyzing the URLs. It is not impossible to have two different posts to HN that are legitimately different content that have the same domain and even partially matching paths. In order to work in a dependable way, the algorithm would have to examine the HTML returned by the URLs to determine they are duplicate pages. In this case, that wouldn't even work because they aren't actually duplicate pages. In order to mark these two URLs as duplicate I think you might need to use some machine learning and then it would only yield some level of confidence that these were indeed duplicate pages as a result. |
|