Hacker News new | ask | show | jobs
by yourabi 6486 days ago
I would also take a look at Heritrix (http://crawler.archive.org/) -- it's what powers the wayback machine.
1 comments

Thanks for the plug!

As a developer of Heritrix, I can't honestly say it's compact or Python, but it is well-behaved, highly customizable (both by settings and by many Java extension points), and capable of high-volume crawling for many purposes.

You could also embed Python code via Jython with a little work, if necessary.