|
|
|
|
|
by etiennedi
1646 days ago
|
|
The models that create the vector embeddings are trained on either general or domain specific knowledge. So, to oversimplify it a bit: The model has learned - based on the training data it was presented with - that "Scandinavian" has a relationship to "Finnish". Since the vector space is high-dimensional you can think of each language concept having a distinct place in that space. In this case the concept for "Scandinavian" and "Finnish" were close enough that you got a matching result. To simplifiy it even more: The vectors do not represent the words but the meaning behind them. So, the two sentences "I like wine" and "The fermented juice of grapes is my favorite beverage" have zero keywords overlapping, but are semantically identical. So a good model would give them very similar vectors even though a traditional search engine would find zero resemblence between them. EDIT: Just realized I didn't answer the second part of your question. Yes, the models are language-specific, but there are also multilingual models that work across a large no. of different languages. |
|