Hacker News new | ask | show | jobs
by gpderetta 1822 days ago
A lifetime ago I used it on a document clustering system. It is not very good as a general similarity function, but excellent for quickly finding near duplicates in linear time. About half of the documents of set I was working on (news articles) were duplicates, so an early removal pass would speed up the actual clustering algorithm pass.

My recollection is a bit fuzzy, but I remember specifically using Min-Hash together with random projection.

1 comments

We used it at work (anti-cybercrime / Phishing focused) company in conjunction with our crawler for clustering of phishing landing pages.

Typically, you'll have loads of script kiddies hosting slightly modified copies of the Apple ID landing page on a compromised web server... Alternatively you'd have loads of these pages build out of "exploit frameworks" / "kits" so we'd want to categorize and group to identify prevalence of a given framework / author.

Had the nice side-effect of prioritizing, speeding up, and automating takedowns for our SOC folks.