Hacker News new | ask | show | jobs
by ChuckMcM 3990 days ago
Interesting approach, a slightly simpler approach is to just take the MD5 hash of paragraphs. Two paragraphs with the same hash are likely identical, and two articles with 2 or more identical paragraphs are likely a dupe.

So as a suggestion try that algorithm with your current infrastructure and let us know how it compares to the Jaccard similarity test.

1 comments

Some blog have standard end paragraph like "If you have read all of this, you may like to subscribe to my rss", or "We are always hiring at ABC, send your resume." Another problem are short captions that look like a paragraph for the html parser, like "Advertisment" or "XYZ Benchmark (higher is better)". One possible solution is to skip the paragraphs that have less than ¿150? letters.
I agree that it is quite reasonable to ignore paragraphs that are fewer than 3 sentences.