Hacker News new | ask | show | jobs
by neoneye2 1631 days ago
I have experimented using LSH (Locality Sensitive Hashing) for identifying similar documents, among 50k documents in total.

My LSH implementation is here: https://github.com/loda-lang/loda-rust/blob/develop/script/t...

Example of the 100 most similar documents: https://github.com/neoneye/loda-identify-similar-programs/bl...

There can be false positives, so after LSH then do a more in-depth comparison.