Hacker News new | ask | show | jobs
by netaddict 5723 days ago
Why did Google blacklist all of your Tldscan sites? Was it just because your sites' content was updated automatically? Or was it because you did something wrong for SEO?
2 comments

Here's my (completely unsubstantiated) theory. It happend literally the day after crossing $500 & 50K views in adsense. I'm guessing one of those was a trigger for manual review by some contractor, perhaps overseas. They looked at my sites for 3 sec, found them to be cookie-cutter and decided to blacklist the account. I get the impression they shoot first, ask questions later. I didn't feel like dealing with it all or starting over so I just moved on to other things.
I talked to someone from Google at I/O who should know and he claimed they don't play "Whack a mole" with websites. They will tweak their ranking algorithm to punish the behavior they see in a web site they don't want to be ranked.
That was probably me. We have two sides to the webspam team at Google: engineering and manual. We definitely prefer to write algorithms so that we avoid dealing with individual websites--the idea is that you strive to fix the root cause of an issue, not to tackle specific sites. However, if we see a website that violates our guidelines and that gets past the algorithms, we are willing to take manual action. Where possible, we use the output of the manual team not only to reduce spam itself, but to train the next iteration of algorithms.

For example, one of the big issues in blackhat spam this past year was illegally hacked sites. Our algorithms weren't doing the best job on hacked sites, so the manual team kept an eye out for hacked sites to remove them (and often to alert the website owners that they'd been hacked). The data generated by the manual team helped us build and deploy multiple new algorithms to detect hacked sites, leading to a 90% reduction in the number of hacked sites showing up in Google's search results in the past few months. That decrease in hacked spam in turn frees up the manual team to tackle the next bleeding-edge technique the spammers use.

I suspect every major search engine uses similar approaches: try to stop the majority of spam with algorithms, but be willing to take action in the mean time while engineers work to improve the algorithms.

Great to know. Out of curiosity, in this particular case, did you save supposed violations for each site, or did you blacklist all of them based on a few?
It varies for different cases depending on a lot of factors like severity, impact on users, etc. In the particular case from above, to find out the history of what might have happened, I just picked a domain at random and dug into its history to find the autogenerated pages with tons of typos for each domain.

I kinda thought one example would make the point. Does it help that much more to give another example? I can look more up. For http://www.bigbadblogdirectory.com/ it looks like you were autogenerating typos not just for websites, but for popular blogs. So http://www.bigbadblogdirectory.com/jeffmatthewsisnotmakingth... looks like it had

(I had to cut out the vast majority of the typos because the comment was too long for HN.)

jeffmatthewsisnotmakingthisup.blogspoot.com, jeffmatthewsisnotmakingthisup.bloyspot.com, jegfmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnomakingthisup.blogspot.com, jeffmatthwesisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup.nlogspot.com, jeffmatthewsisnotmakingthisup.blogspot.ccom, jeffmatthewsisnotmakingthisup.bligspot.com, jeffmatthewsisnotakingthisup.blogspot.com, jeffmatthewsisnotmakinghtisup.blogspot.com, jeffmatthewsisnotmacingthisup.blogspot.com, jdffmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnot akingthisup.blogspot.com, ieffmatthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup/blogspot.com, jeffmatthewsisnotmajingthisup.blogspot.com, jeffmatthewsisnotmakingthishp.blogspot.com, jeff atthewsisnotmakingthisup.blogspot.com, jeffmatthewsisnotmakingthisup.blogspot/com, jeffmatthewwisnotmakingthisup.blogspot.com."

I could post more examples from the other domains, but my point is that this is the sort of thing that users dislike and complain about. If you were a blogger and saw pages like this ranking for your name or your site's name, you probably wouldn't be happy either. From looking at a few domains, I don't think that we overgeneralized from a few pages in this case.

I know that you've moved on and the domains are shut down now. And I'm not trying to be cantankerous. I'm just trying to say that from our point of view there's good reasons to take action on sites like this so that users don't complain to us.

So, basically what you're saying is I went wrong with the typos? I got really excited by my algo and was overzealous with adding it. I believe I did take it off of the sites I issued re-inclusion requests for, but they never got re-included and I never got any messages back (to my knowledge). Also, they were not on every one of those domains.

Each site took a long time to make actually. They either involved generating a data set from scratch or piecing together and parsing other large data sets. This one in particular, I was crawling the Web for feed discovery and was planning on adding stuff like grouping the best posts by category, etc.

Yeah, would love to know about some others, e.g. japanese2englishdictionary.com, idnscan.com, serverslist.com. Also, did you actually get any complaints about this or was it triggered by some other threshold/thing? On a side note, I still get requests about exposing some of this data, i.e. sites behind ip addresses or lists of domains matching some criteria. In any case, thx for the info!

I can understand the need to take action. I just think it could have been handled better. If typos were the problem, I would have removed them immediately if someone told me, and that could have been automated. In retrospect, it seems pretty obvious, but it wasn't at the time.

Matt Cutts often tells us this - but he talks specifically from a search web spam perspective, I suspect the Adsense team have different rules and probably can (and do) "play Whack a mole" when appropriate.
Interesting, but in this case it wasn't that it was ranked lower, but that one day I had 25 domains indexed fine and the next day they were no where to be found, i.e. not in the index at all. And I had plenty of other domains (not in that adsense account) that were still ranked fine.
Lies, damned lies.
I don't think that was the issue. The fact is that if you've got dozens of websites, each of which has lists of domains/IPs like http://www.mattcutts.com/images/verypopularwebsites-com.png , that is the sort of thing that users complain about and don't want showing up when they do a search. Especially if sites have autogenerated boilerplate content for each one of those links.

I mean, if you're auto-generating a page that has this text: "Elcorillord.com

Common misspellings and typos: Elcroillord.com, Elcorilolrd.com, Elcori.lord.com, Elcofillord.com, www.elcorillord.com, lEcorillord.com, Elcotillord.com, Elcorillor.com, Elcoeillord.com, Elcori,lord.com, Wlcorillord.com, Elcorjllord.com, Elcorillord.coom, Elocrillord.com, Elcor8llord.com, Elvorillord.com, Elcorillprd.com, Elcorillord.cim, Elcorillorf.com, Elcorilloed.com, Elorillord.com, Elckrillord.com, Elcoriplord.com, Elcorillord.ckm, Elcorillord.cm, Elcorillord.ccom, Epcorillord.com, Elcoril;ord.com, Elcoirllord.com, Elcoriillord.com, Elforillord.com, 3lcorillord.com, Elcorollord.com, Elcorillordd.com, Elcorill0rd.com, Elcorillord/com, Elcoriolord.com, Ekcorillord.com, Elcorillord.xom, Elcorillord.co, Elcorilord.com, Elcoillord.com, 4lcorillord.com, Elcoriloord.com, Elcorillorr.com, Eldorillord.com, Elcorillord..com, Elcorrillord.com, http://www.elcorillord.com, El orillord.com, E.corillord.com, Elcorillord. om, Elcorilllrd.com, Elcorillrod.com, Elcoriklord.com, Elcorillorrd.com, Elcorillordcom, Elcorillkrd.com, Elcorillord.om, Elcorlilord.com, Elco4illord.com, Elcorillrd.com, Elcprillord.com, Elcodillord.com, Elcorillordc.om, Ecorillord.com, Elcoorillord.com, Slcorillord.com, Elcorillorx.com, Elcorill9rd.com, Elcorilpord.com, Elcorillord.cpm, Elcorillord.fom, Elco5illord.com, Elc9rillord.com, Elcorillird.com, Elcirillord.com, Elcorillord.clm, Elcorillors.com, Elcorillord.vom, Elcorullord.com, Elcorillord.comm, Elcorillord.c9m, Eocorillord.com, Elcorilloord.com, Elcourillourd.com, E,corillord.com, Elcorkllord.com, Elcorillodr.com, Elcorillodd.com, Elcorillord,com, Elcorillotd.com, Elcorillod.com, Elcorillor.dcom, Elcor9llord.com, Elc0rillord.com, Elcoril,ord.com, Elcorilllord.com, Elcorillo5d.com, EElcorillord.com, Elxorillord.com, E;corillord.com, Elcori;lord.com, Elcorllord.com, Elccorillord.com, Elcrillord.com, Elcoril.ord.com, Elcorilkord.com, Elcorillord.cmo, Ellcorillord.com, Eclorillord.com, Elcorillo4d.com, Rlcorillord.com, wwwelcorillord.com, Elclrillord.com, ElcorillordLcom, Dlcorillord.com, Elcorillofd.com, Elcorillore.com, Elcorillord;com, lcorillord.com, Elcorillorc.com, Elcorillord.c0m, Elcorillord.dom, Elcorillord.ocm."

Surely you have to see where many people would consider that either keyword stuffing, gibberish or typo spam.

Thanks Matt. I was building a business off of these domains and realized that Google rankings were the biggest wildcard, and really didn't want any trouble. So I read the Webmaster guidelines closely and often and didn't think I was violating them.

However, I realize some were closer to the line and I should have focused on being less cookie-cutter and more useful in the domains that were really better (more farther along). I had always intended on coming back and working more on each, but wanted to get placeholders up quickly because it takes a while to get backlinks and indexed.

I guess I'm saying I had hoped I would have at least been contacted with a warning and what was found objectionable before just being totally blacklisted with no reason given. I would have also hoped that each site would have been addressed individually. If I had been contacted and you had said, hey, you need to remove these misspellings off of these sites, I would have done it immediately.

Here are some comments on the above though. Again, from my perspective these weren't violating the guidelines because the pages were useful from the user's perspective and there were no hidden tricks going on.

First off, there were actually many categories of sites, domains was just one of them. Others were sports stats, definitions, language, medical, and addresses. For each site I made, I was modelling it off of other sites that had gotten great Google rankings for years. I had hoped to eventually improve the UX on those sites and get similar rankings. For domains, I'm talking mainly about who.is and domaintools.com.

Each domain had a static site index, and that's what you linked to above in the screen shot. The extensive ones weren't really meant to be browsed, but just so search engines could find the pages (pre my knowledge of sitemaps). It's no different than any of the other static sitemaps, e.g. http://who.is/whois_index/index.php, and most of them looked better than the screenshot.

That one in particular came from the code for the streetsandzips site that was a big tag cloud. I was trying to find ways to make the static site better, and that was one of them. It looks better when the fonts are of different sizes :). I had intended for that site to make them different sizes based on the traffic numbers, so Google, Facebook, would be really big, etc. On the streetsandzips site the bigger cities are bigger.

In fact, I believe I evolved the sites so that those (site index) pages had noindex,follow on them such that they wouldn't come up on search results. I also added a search engine (Google custom search) on each page as well. I don't remember if I got to the tag cloud sizes for this particular domain at the time of blacklisting.

As for the misspellings, I did mess around with those, but not on all sites, and I believe at the time they were blacklisted that had been removed from most of the domains, if not all.

Common misspelling and typos as you know is a tool that people provide to those who buy domains. I built it for that purpose, and wanted to see how many people were actually searching for this stuff, so added it to some of the domains. Turns out, a lot of people do. I didn't just tack it on to the footer or cloak it or whatever; I put it in with a purpose that people ask for, e.g. common misspellings and typos.

Additionally from the users perspective, if they got to this page by typing in one of those misspellings, they were getting a big link to the official site at top and then more information about that site, e.g. siteadvisor rating, traffic, etc. So it was essentially functioning as one-click Did you mean x.

I'm happy to answer more questions about it. But it is pretty clear that it was still shoot first and ask questions later. No one ever contacted me about anything. I wasn't trying to hide anything from Google. It was all in my personal adsense account.

I can understand from a search engine perspective, banning sites. But given I already had a relationship with Google, I expected to be contacted. In fact, at one point I had a call with an Adsense guy from Google trying to help me better optimize my sites for Google! He looked at them and had no issues with them, so I thought I was fine.

Also, IIRC I submitted at least one re-inclusion request after being banned, and never heard a response back from that either. Before submitting that request I did a top to bottom review and tried to remove anything even close to the line, including misspellings I believe.

From what I've seen Google doesn't contact people :) My guess is they also have a policy of not sharing reasons for getting blacklisted, to ensure they're not giving spammers an easy way to fix their website.

They claim they respond to all "Site reconsideration" requests. I had to file one once, they did respond, but with a very non-informative and unhelpful response.

Yeah, in retrospect I should have taken it slower and not gotten as close to the line in the first place. It's totally my fault, and I'm not bitter. As you can tell from the OP, I've had a lot of failure, and I similarly learned from this one.
The tricky part is that the math works out something along the lines of there being ~200,000,000 domains and there being ~20,000 Google employees. At a simplistic level that works out to 10,000 domains per Google employee. Which means that even if Google stopped doing everything else and everyone at Google spent all their time talking to webmasters, they'd each have to answer 10,000 peoples' questions about rankings, how to make their site, whether they have ranking issues, etc. That's oversimplifying somewhat because there's lots of parked domains, but not too much--you'd be surprised how many people want to talk about their parked domains and why they aren't ranked the way they want. My team is vastly smaller than the number of Google employees, of course. And our first order of business has to be worrying about what users see when they search; talking to webmasters is the secondary priority.

The net effect is that we haven't found a way to talk 1:1 with every webmaster, and I'm not sure whether that's possible. The story of webmaster communication for the last few years at Google has been trying to improve scalability of the info. The earliest Google webmaster communicator ("GoogleGuy") answered questions on a webmaster forum. In 2005 I started a blog, which has the advantage of permalinks for posts like http://www.mattcutts.com/blog/seo-mistakes-autogenerated-doo... . We tried doing live webmaster chats, but that would only reach 400-500 webmasters at a time.

The most scalable thing I've found so far is making videos. Here's a video that came out last month about the dangers of autogenerating pages for example: http://www.youtube.com/watch?v=A8bgpWtVHo4 . We're at almost 300 videos now, and we're getting closer to 3M total views on our webmaster video channel. The hope is that this additional guidance helps people self-identify what can cause issues to avoid or to correct them without needing to talk to Google.

The other big tool that has been helpful is http://google.com/webmasters/ . That provides tools to identify the common errors/mistakes that webmasters make (crawl errors, 404 pages, canonicalization, robots.txt issues, identifying hacked sites using the "Fetch as Googlebot" feature, etc.). That helps with many of the straightforward issues, but of course it doesn't solve the issue with "sheer number of webmasters who have ranking questions vs. number of Googlers." If anyone has suggestions on how to tackle communication with webmasters in a more scalable way, I'd appreciate feedback on how to do better on that.

In my case some clicks worth over $2 (if I remember correctly) triggered manual review. They sent me canned email about tos etc.

The funny part is that the domain in question was already expired when email arrived because I've decided to stop this venture.

I wonder how many otherwise viable little businesses fall victim to this heavy handed approach.

Is this one of the reasons DDG got started?

Well, it did push me onto other things, though I wasn't (at least consciously) looking to get back at Google.
Maybe this can help . . .

I have a few websites that automatically make new posts. As of 10/14, they all show 0 pages indexed in Google. Previously they would get a few thousand visitors per day.

I guess Google feels as though they violate their terms and removed them. It seems to me it was a manual removal.

I received no emails in webmaster tools about the removal.

"I have a few websites that automatically make new posts."

Making a bunch of autogenerated sites has its risks. For example, if you were just taking a bunch of MP3 names or Hot Trends queries and then scraping twitter for mentions of those phrases and slapping that all up on a website with scripts, that tends to cruft up our index with autogenerated content that users complain about and that violates our quality guidelines. Likewise, if all you were doing was scraping Twitter for phrases like sad or heartbroken or heartless and throwing that scraped Twitter content up on a webpage with a script, users would also complain about that autogenerated content and it would violate our guidelines. Would that be helpful insight?

As a concrete case to discuss, what about something like http://poeet.com/

I made this over a weekend. And the people whose poetry is being captured love it. But it is auto-generated in the sense you're talking about.

It actually went down for a bit and I got a bunch of complaints, enough that I got it back up relatively quickly.

Theres a clear difference between your site http://poeet.com and a clean cut case of auto generated spam. Your site is actually quite creative where it is aggregating content from a twitter hashtag and indexing short poems that may otherwise go unnoticed, you are also showing the users original tweet and @user and not manipulating anything. The original poster was likely scraping content, not providing citation and for the means of having the duplicate content wrapped around ads.
Matt - I agree autogenerated content is a problem and is polluting the search results so I'm glad you guys have taken action. But what about sites like bibleknowledgebookstore.com and articlesubmissionreview.com that are buying links, creating fake content, and spamming web 2.0 profile pages and forums? How come tactics like these are not only working, but dominating competitive markets? What's the point in going after high quality editorial links when sites are rewarded for essentially spamming?
Don't forget Google needs content publishers for Adsense. Surely that is the only reason brain liquifying content mills like ehow don't get the slap down? This junk is ridiculous (and this was one of the first pages I looked at):

http://www.ehow.com/blended-families/

How to Plan a Happy Blended Family How to Harmony in Your New Blended Family How to have harmony in your new Blended Family How to Achieve Harmony in a Blended Family How to Nurture A Blended Family How to Successfully Manage a Blended Family

WTF is this junk? Why does ehow.com get 3 million Google visitors a day? The mind boggles!

Great question, though ehow is created by user submission and paid article writers not an individual scraping other users content, publishing it, and not linking back IE $100 plagiarism. I do agree though that eHow is PURE junk and nothing but a site to generate ad revenue. I am not sure if they offer users who submit articles any profit sharing but they are being jipped as well. eHow is by the people behind Enom and a few other networks who give Google a ton of dough for advertising.
Matt, you completely rock. Those are fine examples. :)
xpose2000, happy to help without getting uncomfortably specific. ;)