|
|
|
|
|
by minimaxir
399 days ago
|
|
A point of note is that the text embeddings model used here is paraphrase-multilingual-MiniLM-L12-v2 (https://huggingface.co/sentence-transformers/paraphrase-mult...), which is about 4 years old. In the NLP world, that's effectively ancient, particularly as the robustness of even small embeddings models due to global LLM improvements has increased dramatically both in information representation and distinctiveness in the embedding space. Even modern text embedding models not explicitly trained for multilingual support still do extremely well on that type of data, so they may work better for the Voynich Manuscript which is a relatively unknown language. The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding. |
|
Appreciate you calling that out — that’s a great push toward iteration.