Hacker News new | ask | show | jobs
by dor_jack_2 3353 days ago
For our purposes Common Crawl's corpus was missing too many websites (possibly due to robots.txt configs of websites) Also we needed some deep coverage which CC could not provide.