| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stephen_mcd 4306 days ago

Great article.

I went through a similar process about a year ago for https://kouio.com (RSS reader). In its case I needed to coalesce closely matching RSS feeds purely for storage and performance. After trialling edit distance and various simhash implementations in Python, we ended up needing to look no further than the standard library's difflib.SequenceMatcher - I wish I documented my findings at the time, but I recall it was the best in terms of speed and accuracy.

Also you might not want to rely on str.isalnum for stripping punctuation. I made the same mistake here: https://twitter.com/stephen_mcd/status/506344236531212288

2 comments

jisaacso 4306 days ago

Thanks for the reference. It looks like SequenceMatch is "cubic time in the worst case and quadratic time in the expected case". Did you notice any performance issues as kouio scaled?

link

stephen_mcd 4306 days ago

Perhaps it was more a case of accuracy for what we were looking at at the time then :-)

It's something we run out of band on a subset of our data, so it's never been performance critical.

link

logic_rabbit 4305 days ago

Hi Stephen. I am one of the creators of http://silverreader.com (RSS reader). I wish you a luck with your new RSS reader launch. Building a good RSS reader is not easy task.

link