| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johnmu 4161 days ago
	FWIW I think the main problem is that you're essentially creating an "infinite space," meaning there's an extremely high number of URLs that are findable through crawling your pages, and the more pages we crawl, the more new ones we find. There's no general & trivial solution to crawling and indexing sites like that, so ideally you'd want to find a strategy that allows indexing of great content from your site, without overly taxing your resources on things that are irrelevant. Making those distinctions isn't always easy... but I'd really recommend taking a bit of time to work out which kinds of URLs you want crawled & indexed, and how they could be made discoverable through crawling without crawlers getting stuck elsewhere. It might even be worth blocking those pages from crawling completely (via robots.txt) until you come up with a strategy for that.

1 comments

johnmu 4161 days ago

And one more thing ... you have some paths that are generating more URLs on their own without showing different content, for example:

http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th-anniversa...

I can't check at the moment, but my guess is that all of these generate the same content (and that you could add even more versions of those keywords in the path too). These were found through crawling, so somewhere within your site you're linking to them, and they're returning valid content, so we keep crawling deeper. That's essentially a normal bug worth fixing regardless of how you handle the rest.