| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by josefcullhed 1554 days ago
	Founder here, I suggest you start by not implementing a crawler but use commoncrawl.org instead. The problem with starting a web crawler is you will need a lot of money and almost all big websites are behind cloudflare so you will be blocked pretty quickly. Crawling is a big issue and most of the issues are non-technical.

2 comments

Seirdy 1554 days ago

I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available.

Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?

link

marginalia_nu 1554 days ago

There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.

link

pmarreck 1554 days ago

First off, nice work!

This seems like a reasonable fallback option but it's also a weaker one. By "most of the issues are non-technical", do you mean that you need special permission from someone like cloudflare to get "crawl rights"?

link