Hacker News new | ask | show | jobs
by johnmu 4119 days ago
Hi, I work with the Google crawling & indexing teams. Let me check with them to see what their recommendation would be. At first glance (based on the pages cached), it seems like we're just following the links within your site, like we would with other websites. In general, there are a few things one could do in a case like this (I might have more from the team later; these are in no particular order):

- Use rel=nofollow on links you don't need to have followed (this prevents passing of PageRank, which generally means we're less likely to crawl them)

- Use 503 for rate-limiting crawlers. 503 means we'll just retry later.

- Use the crawl rate limit in Webmaster Tools (I see you submitted the report there, so that should be active soon)

- If the content is fully auto-generated, you might choose to use a "noindex,nofollow" robots meta tag on these pages to prevent them from being indexed separately. It's hard for me to judge how useful your content would be in search directly.

2 comments

A big thank you for enquiring on my behalf!

A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

I have seen "noindex nofollow" kill a site stone dead in the past so I am very wary indeed of using it. In my experience once you've noindexed a page it is nigh-on impossible to get the engine to index it again.

My content is autogenerated, though I hope it has enough value to be considered useful. It's time-series data of word frequencies in politics, so for example you might use it to see how one candidate is doing relative to another in an election campaign.

FWIW I think the main problem is that you're essentially creating an "infinite space," meaning there's an extremely high number of URLs that are findable through crawling your pages, and the more pages we crawl, the more new ones we find. There's no general & trivial solution to crawling and indexing sites like that, so ideally you'd want to find a strategy that allows indexing of great content from your site, without overly taxing your resources on things that are irrelevant. Making those distinctions isn't always easy... but I'd really recommend taking a bit of time to work out which kinds of URLs you want crawled & indexed, and how they could be made discoverable through crawling without crawlers getting stuck elsewhere. It might even be worth blocking those pages from crawling completely (via robots.txt) until you come up with a strategy for that.
And one more thing ... you have some paths that are generating more URLs on their own without showing different content, for example:

http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th-anniversa...

I can't check at the moment, but my guess is that all of these generate the same content (and that you could add even more versions of those keywords in the path too). These were found through crawling, so somewhere within your site you're linking to them, and they're returning valid content, so we keep crawling deeper. That's essentially a normal bug worth fixing regardless of how you handle the rest.

> A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

And persistence to track how many crawl requests have been served in the last N minutes. Even blindly serving a million 503's an hour could get really expensive.

Having a page that goes nofollow/noindex and back is fine, when we recrawl it, we'll take the new state into account.
Wouldn't "429 Too Many Requests" be more appropriate than 503? Or maybe Google doesn't respect 429?