| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zhangce 969 days ago
	We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further. There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents