|
|
|
|
|
by stephen_mcd
4306 days ago
|
|
Great article. I went through a similar process about a year ago for https://kouio.com (RSS reader). In its case I needed to coalesce closely matching RSS feeds purely for storage and performance. After trialling edit distance and various simhash implementations in Python, we ended up needing to look no further than the standard library's difflib.SequenceMatcher - I wish I documented my findings at the time, but I recall it was the best in terms of speed and accuracy. Also you might not want to rely on str.isalnum for stripping punctuation. I made the same mistake here: https://twitter.com/stephen_mcd/status/506344236531212288 |
|