|
|
|
|
|
by adamseabrook
3933 days ago
|
|
meanpath.com can do around 200 million pages per day using 13 fairly average dedicated servers. We only crawl the front page (mile wide, inch deep) so the limiting factor is actually DNS. Looking at the network traffic the bandwidth is split evenly between DNS and HTTP. Google public DNS will quickly rate limit you so you need to use your own resolvers (we use Unbound). Unlike Blekko we are just capturing the source and dumping it into a DB without doing any analysis. As soon as you start trying to parse anything in the crawl data your hardware requirements go through the roof. parallel with wget or curl is enough to crawl millions of pages per day. I often use http://puf.sourceforge.net/ when I need to do a quick crawl "puf -nR -Tc 5 -Tl 5 -Td 20 -t 1 -lc 200 -dc 5 -i listofthingstodownload" will easily do 10-20 million pages per day if you are spreading your requests across a lot of hosts. |
|