Hacker News new | ask | show | jobs
by zhangce 969 days ago
We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.

There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents