| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thekeywordgeek 4120 days ago

A big thank you for enquiring on my behalf!

A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

I have seen "noindex nofollow" kill a site stone dead in the past so I am very wary indeed of using it. In my experience once you've noindexed a page it is nigh-on impossible to get the engine to index it again.

My content is autogenerated, though I hope it has enough value to be considered useful. It's time-series data of word frequencies in politics, so for example you might use it to see how one candidate is doing relative to another in an election campaign.

3 comments

johnmu 4120 days ago

FWIW I think the main problem is that you're essentially creating an "infinite space," meaning there's an extremely high number of URLs that are findable through crawling your pages, and the more pages we crawl, the more new ones we find. There's no general & trivial solution to crawling and indexing sites like that, so ideally you'd want to find a strategy that allows indexing of great content from your site, without overly taxing your resources on things that are irrelevant. Making those distinctions isn't always easy... but I'd really recommend taking a bit of time to work out which kinds of URLs you want crawled & indexed, and how they could be made discoverable through crawling without crawlers getting stuck elsewhere. It might even be worth blocking those pages from crawling completely (via robots.txt) until you come up with a strategy for that.

link

johnmu 4120 days ago

And one more thing ... you have some paths that are generating more URLs on their own without showing different content, for example:

http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th-anniversa...

I can't check at the moment, but my guess is that all of these generate the same content (and that you could add even more versions of those keywords in the path too). These were found through crawling, so somewhere within your site you're linking to them, and they're returning valid content, so we keep crawling deeper. That's essentially a normal bug worth fixing regardless of how you handle the rest.

link

falcolas 4120 days ago

> A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

And persistence to track how many crawl requests have been served in the last N minutes. Even blindly serving a million 503's an hour could get really expensive.

link

johnmu 4120 days ago

Having a page that goes nofollow/noindex and back is fine, when we recrawl it, we'll take the new state into account.

link