We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.
For our purposes Common Crawl's corpus was missing too many websites (possibly due to robots.txt configs of websites) Also we needed some deep coverage which CC could not provide.