|
|
|
|
|
by chris_f
1868 days ago
|
|
Nice! Maybe at one point you can release a general web search engine for the Common Crawl corpus? It seems even simpler than this proof of concept, but potentially more useful for people looking for a true full text web search. There isn't an easy way today to explore or search what is contained in the Common Crawl index. |
|
By that you mean searching the full text contents of their crawl, right?
The index is super easy to search nowadays -- in pretty much any language you can slap a few lines of code around a get request (using range requests [0] if needed), and explore a columnar representation of the index [1].
[0] https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...
[1] https://commoncrawl.org/2018/03/index-to-warc-files-and-urls...