Hacker News new | ask | show | jobs
by mig 6730 days ago
Don't write your own crawler. Use nutch.

It is designed to scale and do mapreduce kind of parallel processing. I would strongly recommend you to take a look before writing your own.

http://lucene.apache.org/nutch/

1 comments

mapreduce? Just how many requests will you be making to third-party sites at once? Sounds like a good way to get blocked fast.