Hacker News new | ask | show | jobs
by applgo443 969 days ago
If it's 5 common crawls, isn't data across multiple common crawls mostly similar?
1 comments

We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.

There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents