|
So, basically what you're saying is I went wrong with the typos? I got really excited by my algo and was overzealous with adding it. I believe I did take it off of the sites I issued re-inclusion requests for, but they never got re-included and I never got any messages back (to my knowledge). Also, they were not on every one of those domains. Each site took a long time to make actually. They either involved generating a data set from scratch or piecing together and parsing other large data sets. This one in particular, I was crawling the Web for feed discovery and was planning on adding stuff like grouping the best posts by category, etc. Yeah, would love to know about some others, e.g. japanese2englishdictionary.com, idnscan.com, serverslist.com. Also, did you actually get any complaints about this or was it triggered by some other threshold/thing? On a side note, I still get requests about exposing some of this data, i.e. sites behind ip addresses or lists of domains matching some criteria. In any case, thx for the info! I can understand the need to take action. I just think it could have been handled better. If typos were the problem, I would have removed them immediately if someone told me, and that could have been automated. In retrospect, it seems pretty obvious, but it wasn't at the time. |
There's also a class of folks we call navigation spammers who try to show up for tons of domain name queries. I can give you some history to provide context. In the old days, when you searched for [myspace.com] we'd show a single result as if someone had done the query [info:myspace.com]. The problem is that people would misspell it and do the query [mypsace.com], and then we'd end up either show no result or (usually) a low-quality typo-squatting url. So we made url queries be a string search, so [myspace.com] would return 10 results. That way if someone misspelled the query, they might get the exact-match bad url at #1, but they'd probably get the right answer somewhere else in the top 10. Overall, the change was a big win, because 10% of our queries are misspelled. But if you're showing 10 results for url queries, now there's an opportunity for spammers to SEO for url queries and get dregs of traffic from the #2 to #10 positions. Now we're getting closer to present-day, so I'll just say we've made algorithmic changes to reduce the impact of that.
But you were hitting a bunch of different factors: tons of typos, specifically for misspelled url queries, autogenerated content, lots of different domain names that looked to have a fair amount of overlap (expireddomainscan.com, registereddomainscan.com, refundeddomainscan.com, etc.). If you were doing this again, I'd recommend fewer domain names and putting more UI/value-add work on the individual domains.