Hacker News new | ask | show | jobs
by nerdponx 1317 days ago
Exact and near duplicate articles should have similar or identical word frequency distributions. Maybe that can be used as a blocking criterion somehow. Although it might not be any faster to compare word frequency distributions than to compare dense low-dimensional embeddings.