Hacker News new | ask | show | jobs
by zhangce 969 days ago
It is around 100TB (84 CommonCrawl dumps, roughly 1TB per dump)
1 comments

yes, small clarification: the 1TB per dump refers to the head+middle partition of the dataset and includes the text documents and the quality signals. There is another ~700GB for the minhash signatures and 1-1.5TB for the documents in the tail split.