| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nerdponx 1317 days ago
	Exact and near duplicate articles should have similar or identical word frequency distributions. Maybe that can be used as a blocking criterion somehow. Although it might not be any faster to compare word frequency distributions than to compare dense low-dimensional embeddings.