|
|
|
|
|
by nerdponx
1317 days ago
|
|
Exact and near duplicate articles should have similar or identical word frequency distributions. Maybe that can be used as a blocking criterion somehow. Although it might not be any faster to compare word frequency distributions than to compare dense low-dimensional embeddings. |
|