| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by etiennedi 1646 days ago
	The models that create the vector embeddings are trained on either general or domain specific knowledge. So, to oversimplify it a bit: The model has learned - based on the training data it was presented with - that "Scandinavian" has a relationship to "Finnish". Since the vector space is high-dimensional you can think of each language concept having a distinct place in that space. In this case the concept for "Scandinavian" and "Finnish" were close enough that you got a matching result. To simplifiy it even more: The vectors do not represent the words but the meaning behind them. So, the two sentences "I like wine" and "The fermented juice of grapes is my favorite beverage" have zero keywords overlapping, but are semantically identical. So a good model would give them very similar vectors even though a traditional search engine would find zero resemblence between them. EDIT: Just realized I didn't answer the second part of your question. Yes, the models are language-specific, but there are also multilingual models that work across a large no. of different languages.

1 comments

mariushn 1646 days ago

Thank you! Now I understand why others mentioned that generating vectors is the hard part.

etiennedi 1646 days ago

I agree, but at the same time now is the easiest it's ever been to create great vectors. Sentence-Bert [1] by Nils Reimers is a collection of pre-trained models specficially trained to create good vectors. You can use them out of the box with Weaviate. All you have to do is select your desired model [2] and your text (or images, etc.) will be translated into vectors at import time. As I mentioned in another comment, with Weaviate the goal is to make it as easy to use as any existing search engine or database while still providing you the benefits of Deep Learning & Vector Search.

[1] Sentence-BERT: https://sbert.net

[2] Weaviate Customizer with Out-of-the-box models: https://www.semi.technology/developers/weaviate/current/gett...