|
|
|
|
|
by dor_jack_2
3354 days ago
|
|
We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything. As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark) Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results. |
|