Hacker News new | ask | show | jobs
by chris_f 1868 days ago
Nice! Maybe at one point you can release a general web search engine for the Common Crawl corpus? It seems even simpler than this proof of concept, but potentially more useful for people looking for a true full text web search.

There isn't an easy way today to explore or search what is contained in the Common Crawl index.

2 comments

> There isn't an easy way today to explore or search what is contained in the Common Crawl index.

By that you mean searching the full text contents of their crawl, right?

The index is super easy to search nowadays -- in pretty much any language you can slap a few lines of code around a get request (using range requests [0] if needed), and explore a columnar representation of the index [1].

[0] https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...

[1] https://commoncrawl.org/2018/03/index-to-warc-files-and-urls...

That's on my to-do list for next week. :)