Hacker News new | ask | show | jobs
by dor_jack_2 3354 days ago
We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.