Hacker News new | ask | show | jobs
by Seirdy 1556 days ago
I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available.

Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?

1 comments

There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.