Hacker News new | ask | show | jobs
by dm_i386 3350 days ago
What tools did you use? What had to be custom-written and why?
1 comments

We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.