| What we make available is: -- (A) the dataset after pre-processing the raw CommonCrawl data (e.g., text extraction and language identification) and some minimal filtering; and (B) for each document in (A), we also pre-computed 40+ of "features" (we call the "quality annotations") you can use to further filter it or deduplicate it. For example, one such feature is "how similar this document is to Wikipedia". -- (A) is around 30T tokens, but you might want to use features in (B) to further filter/dedup it down, e.g., to 5T. For example, if in your application documents similar to Wikipedia are the most helpful documents, you can take the top documents with the highest score for the feature "how similar this document is to Wikipedia". Of course, the really interesting case happens when you consider a larger subset of these features (or maybe even automatically learn what the best way of filtering it is). Our goal is to make this as flexible as possible such that you can fit this into your own application. What we have released is both (A) and (B) If you have any questions, please let us know! Thanks for your interests, have fun with the data! |
> how similar this document is to Wikipedia
So that’s a measure of how similar it is to the background vector of all (language in focus) Wikipedia data?