Hacker News new | ask | show | jobs
by aronpye 1631 days ago
A lot of the spam results just seem to be copy pasted content.

I wonder how difficult it is to compare the main body of text in search results, then say if it is over a 95% match with another site (I.e. it has been copy-pasted), demote it in the search results. If a site generates too many of these demotions then it gets blacklisted from the index.

2 comments

I have experimented using LSH (Locality Sensitive Hashing) for identifying similar documents, among 50k documents in total.

My LSH implementation is here: https://github.com/loda-lang/loda-rust/blob/develop/script/t...

Example of the 100 most similar documents: https://github.com/neoneye/loda-identify-similar-programs/bl...

There can be false positives, so after LSH then do a more in-depth comparison.

How would you avoid throwing the original site out with the bathwater?
Maybe try and time stamp the page, presumably the earliest page is the original source. Could also combine it with a site reputation rating or something similar.