| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by peterwwillis 2984 days ago

Like, extreme performance, or just parallelism? One example of parallelism: xargs -a urls.txt -n 5 -P 20 wget -nv --spider -T 10 -e robots=off. This will run up to 20 processes with 5 URLs each. It's not "efficient" but it's faster than nothing, and you get the whole feature set of Wget.

For more customizeable spidering, Scrapy allows you to customize a spider, and even deploy spider daemons to run in production (https://doc.scrapy.org/en/latest/topics/deploy.html). For an out-of-the-box version, try Spidy (https://github.com/rivermont/spidy). For super serious spidering, try Heritrix (https://webarchive.jira.com/wiki/spaces/Heritrix/overview) or Nutch (https://nutch.apache.org/).

Here's an interesting read on crawling a quarter billion pages in 40 hours: http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil... From my own experience crawling massive dynamic state-driven websites, even if you're trying to just grab a single page, you will eventually want the extra features.