We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.
There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents
There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents