|
|
|
|
|
by codepoet
6052 days ago
|
|
http://news.ycombinator.com/item?id=840244 Crawling "only" 120k pages can be done easily with a pure Python solution over a normal home / office internet connection. The packages urllib, urllib2, robotexclusionrulesparser and lxml are a good start. Important: Don't forget to implement a crawl rate limit. |
|