Hacker News new | ask | show | jobs
by codepoet 6052 days ago
http://news.ycombinator.com/item?id=840244

Crawling "only" 120k pages can be done easily with a pure Python solution over a normal home / office internet connection. The packages urllib, urllib2, robotexclusionrulesparser and lxml are a good start.

Important: Don't forget to implement a crawl rate limit.

2 comments

80legs automatically handles the crawl rate limits for you.
That's probably not the primary reason to use 80legs - but avoiding to implement a whole crawler.
Thank you for posting that link to previous HN discussion. They mention Scrapy http://scrapy.org/ and I looked at it. I liked the fact that it is Python based and the tutorial is very good. They even have a shell to test HPath Selectors. Now I have a better understanding of the process. Of course, it is not like filling a form as the case with 80legs, but I am having fun working through the tutorial. I also ran a couple of small jobs with 80legs but I am unable to see the results. I guess 80legs would be good for huge projects. In any case, I will try to work with both. Thanks again.

Another discussion about scrapy http://news.ycombinator.com/item?id=411733