|
|
|
|
|
by aronpye
1631 days ago
|
|
A lot of the spam results just seem to be copy pasted content. I wonder how difficult it is to compare the main body of text in search results, then say if it is over a 95% match with another site (I.e. it has been copy-pasted), demote it in the search results. If a site generates too many of these demotions then it gets blacklisted from the index. |
|
My LSH implementation is here: https://github.com/loda-lang/loda-rust/blob/develop/script/t...
Example of the 100 most similar documents: https://github.com/neoneye/loda-identify-similar-programs/bl...
There can be false positives, so after LSH then do a more in-depth comparison.