|
|
|
|
|
by johnmu
4119 days ago
|
|
Hi, I work with the Google crawling & indexing teams. Let me check with them to see what their recommendation would be. At first glance (based on the pages cached), it seems like we're just following the links within your site, like we would with other websites. In general, there are a few things one could do in a case like this (I might have more from the team later; these are in no particular order): - Use rel=nofollow on links you don't need to have followed (this prevents passing of PageRank, which generally means we're less likely to crawl them) - Use 503 for rate-limiting crawlers. 503 means we'll just retry later. - Use the crawl rate limit in Webmaster Tools (I see you submitted the report there, so that should be active soon) - If the content is fully auto-generated, you might choose to use a "noindex,nofollow" robots meta tag on these pages to prevent them from being indexed separately. It's hard for me to judge how useful your content would be in search directly. |
|
A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.
I have seen "noindex nofollow" kill a site stone dead in the past so I am very wary indeed of using it. In my experience once you've noindexed a page it is nigh-on impossible to get the engine to index it again.
My content is autogenerated, though I hope it has enough value to be considered useful. It's time-series data of word frequencies in politics, so for example you might use it to see how one candidate is doing relative to another in an election campaign.