|
|
|
|
|
by zhangce
969 days ago
|
|
We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further. There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents |
|