Hacker News new | ask | show | jobs
by josefcullhed 1554 days ago
Founder here,

I suggest you start by not implementing a crawler but use commoncrawl.org instead. The problem with starting a web crawler is you will need a lot of money and almost all big websites are behind cloudflare so you will be blocked pretty quickly. Crawling is a big issue and most of the issues are non-technical.

2 comments

I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available.

Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?

There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.
First off, nice work!

This seems like a reasonable fallback option but it's also a weaker one. By "most of the issues are non-technical", do you mean that you need special permission from someone like cloudflare to get "crawl rights"?