|
|
|
|
|
by mci
3112 days ago
|
|
Sounds like a fun project. However, I doubt if word vectors buy you anything more than, say, old good Nilsimsa from 2001 (https://en.wikipedia.org/wiki/Nilsimsa_Hash).
Side note: py-nilsimsa should iterate over Unicode points instead of UTF-8 bytes. As it stands now, the similarity of any texts in the same language using a non-Latin script is ~80 rather than ~0. |
|