| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Seirdy 1556 days ago
	I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available. Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?

1 comments

marginalia_nu 1556 days ago

There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.

link