| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by peterwwillis 2984 days ago
	So, I'll bite. Why user this and not wget, curl, or any other http spider?

2 comments

pknopf 2984 days ago

Is there a good html parser that you can pipe into/out of? Using Go, it would be easier to parse the html, and tell the difference between <p>http://somewhere.com/</p> and <a href="http://somewhere.com/">test</p>

link

laumars 2984 days ago

Personally I don't care about the difference between your two examples because a published URL is still a published URL regardless of whether it is a hyperlink or not.

However where I do care about the difference between an anchor and a paragraph block is with relative links or other URLs without the protocol prefix as those are harder to programmatically guess what is a web URL and what is an example file system path in a technical document (for example). In an ideal world the JavaScript, CSS and any web-APIs (eg JSON returns) would be executed locally to check any modern way of abstracting away URLs (page redirection et al). But that's not to say there isn't a place for a less sophisticated parser (though I would say that as I've also written a link checker similar to the one posted hehehe).

link

ryanlol 2984 days ago

Because sometimes you care about performance?

Although, based on a quick look at the code this thing isn't going to go particularly fast.

link

peterwwillis 2984 days ago

Like, extreme performance, or just parallelism? One example of parallelism: xargs -a urls.txt -n 5 -P 20 wget -nv --spider -T 10 -e robots=off. This will run up to 20 processes with 5 URLs each. It's not "efficient" but it's faster than nothing, and you get the whole feature set of Wget.

For more customizeable spidering, Scrapy allows you to customize a spider, and even deploy spider daemons to run in production (https://doc.scrapy.org/en/latest/topics/deploy.html). For an out-of-the-box version, try Spidy (https://github.com/rivermont/spidy). For super serious spidering, try Heritrix (https://webarchive.jira.com/wiki/spaces/Heritrix/overview) or Nutch (https://nutch.apache.org/).

Here's an interesting read on crawling a quarter billion pages in 40 hours: http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil... From my own experience crawling massive dynamic state-driven websites, even if you're trying to just grab a single page, you will eventually want the extra features.

link