| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorhel 5101 days ago
	> I managed to crawl [..] more than 300k movies from IMDB in just a few hours I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.

2 comments

alexbardas 5101 days ago

Very true indeed. I was also randomly changing user-agents (Mozilla, Safari, Chrome, IE). I thought that this will be harder to tell whether there is a lot of traffic from the same network or someone is just intensively crawling the site.

For me, it was more a proof of how efficient and fast a crawler can be. Also, a response from IMDB was very fast in less than 0.4 seconds, so not that much time was lost there.

link

binarysolo 5101 days ago

Gray hat question out of curiosity and possible experience: did you also use proxies or perhaps even Tor?

link

joshu 5101 days ago

so how polite does one need to be? One hit per x seconds?

link

yorhel 5100 days ago

If the /robots.txt does not mention a Crawl-delay, one page per 3 seconds is often a safe value. Of course this rather heavily depends on the site. In any case, if you have any specific need, always contact the people responsible for the site. I occasionaly run custom queries against the database on request, for example.

link