Hacker News new | ask | show | jobs
by epi0Bauqu 5722 days ago
Great to know. Out of curiosity, in this particular case, did you save supposed violations for each site, or did you blacklist all of them based on a few?
1 comments

It varies for different cases depending on a lot of factors like severity, impact on users, etc. In the particular case from above, to find out the history of what might have happened, I just picked a domain at random and dug into its history to find the autogenerated pages with tons of typos for each domain.

I kinda thought one example would make the point. Does it help that much more to give another example? I can look more up. For http://www.bigbadblogdirectory.com/ it looks like you were autogenerating typos not just for websites, but for popular blogs. So http://www.bigbadblogdirectory.com/jeffmatthewsisnotmakingth... looks like it had

(I had to cut out the vast majority of the typos because the comment was too long for HN.)

jeffmatthewsisnotmakingthisup.blogspoot.com, jeffmatthewsisnotmakingthisup.bloyspot.com, jegfmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnomakingthisup.blogspot.com, jeffmatthwesisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup.nlogspot.com, jeffmatthewsisnotmakingthisup.blogspot.ccom, jeffmatthewsisnotmakingthisup.bligspot.com, jeffmatthewsisnotakingthisup.blogspot.com, jeffmatthewsisnotmakinghtisup.blogspot.com, jeffmatthewsisnotmacingthisup.blogspot.com, jdffmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnot akingthisup.blogspot.com, ieffmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup/blogspot.com, jeffmatthewsisnotmajingthisup.blogspot.com, jeffmatthewsisnotmakingthishp.blogspot.com, jeff atthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup.blogspot/com, jeffmatthewwisnotmakingthisup.blogspot.com."

I could post more examples from the other domains, but my point is that this is the sort of thing that users dislike and complain about. If you were a blogger and saw pages like this ranking for your name or your site's name, you probably wouldn't be happy either. From looking at a few domains, I don't think that we overgeneralized from a few pages in this case.

I know that you've moved on and the domains are shut down now. And I'm not trying to be cantankerous. I'm just trying to say that from our point of view there's good reasons to take action on sites like this so that users don't complain to us.

So, basically what you're saying is I went wrong with the typos? I got really excited by my algo and was overzealous with adding it. I believe I did take it off of the sites I issued re-inclusion requests for, but they never got re-included and I never got any messages back (to my knowledge). Also, they were not on every one of those domains.

Each site took a long time to make actually. They either involved generating a data set from scratch or piecing together and parsing other large data sets. This one in particular, I was crawling the Web for feed discovery and was planning on adding stuff like grouping the best posts by category, etc.

Yeah, would love to know about some others, e.g. japanese2englishdictionary.com, idnscan.com, serverslist.com. Also, did you actually get any complaints about this or was it triggered by some other threshold/thing? On a side note, I still get requests about exposing some of this data, i.e. sites behind ip addresses or lists of domains matching some criteria. In any case, thx for the info!

I can understand the need to take action. I just think it could have been handled better. If typos were the problem, I would have removed them immediately if someone told me, and that could have been automated. In retrospect, it seems pretty obvious, but it wasn't at the time.

The typos were definitely going overboard. I can understand the appeal of "I've got this great tool--what can I do with it?" But we get a lot of complaints about typo spam, so that's a sensitive issue. I definitely would have done less of that.

There's also a class of folks we call navigation spammers who try to show up for tons of domain name queries. I can give you some history to provide context. In the old days, when you searched for [myspace.com] we'd show a single result as if someone had done the query [info:myspace.com]. The problem is that people would misspell it and do the query [mypsace.com], and then we'd end up either show no result or (usually) a low-quality typo-squatting url. So we made url queries be a string search, so [myspace.com] would return 10 results. That way if someone misspelled the query, they might get the exact-match bad url at #1, but they'd probably get the right answer somewhere else in the top 10. Overall, the change was a big win, because 10% of our queries are misspelled. But if you're showing 10 results for url queries, now there's an opportunity for spammers to SEO for url queries and get dregs of traffic from the #2 to #10 positions. Now we're getting closer to present-day, so I'll just say we've made algorithmic changes to reduce the impact of that.

But you were hitting a bunch of different factors: tons of typos, specifically for misspelled url queries, autogenerated content, lots of different domain names that looked to have a fair amount of overlap (expireddomainscan.com, registereddomainscan.com, refundeddomainscan.com, etc.). If you were doing this again, I'd recommend fewer domain names and putting more UI/value-add work on the individual domains.