|
|
|
|
|
by gpderetta
1822 days ago
|
|
A lifetime ago I used it on a document clustering system. It is not very good as a general similarity function, but excellent for quickly finding near duplicates in linear time. About half of the documents of set I was working on (news articles) were duplicates, so an early removal pass would speed up the actual clustering algorithm pass. My recollection is a bit fuzzy, but I remember specifically using Min-Hash together with random projection. |
|
Typically, you'll have loads of script kiddies hosting slightly modified copies of the Apple ID landing page on a compromised web server... Alternatively you'd have loads of these pages build out of "exploit frameworks" / "kits" so we'd want to categorize and group to identify prevalence of a given framework / author.
Had the nice side-effect of prioritizing, speeding up, and automating takedowns for our SOC folks.