| HN Mirror

> There isn't an easy way today to explore or search what is contained in the Common Crawl index.

By that you mean searching the full text contents of their crawl, right?

The index is super easy to search nowadays -- in pretty much any language you can slap a few lines of code around a get request (using range requests [0] if needed), and explore a columnar representation of the index [1].

[0] https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...

[1] https://commoncrawl.org/2018/03/index-to-warc-files-and-urls...