|
|
|
|
|
by ChuckMcM
3990 days ago
|
|
Interesting approach, a slightly simpler approach is to just take the MD5 hash of paragraphs. Two paragraphs with the same hash are likely identical, and two articles with 2 or more identical paragraphs are likely a dupe. So as a suggestion try that algorithm with your current infrastructure and let us know how it compares to the Jaccard similarity test. |
|