| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throw156754228 697 days ago
	The model of a word to a vector breaks down really quickly one you introduce the context and complexity of human language. That's why we went to contextual embeddings, but even they have issues. Curious would it handle negation of trained keywords, e.g "not urgent"?

3 comments

VHRanger 697 days ago

For a lot of cases, word2vec/glove still work plenty well. It also runs much faster and lighter when doing development -- the FSE library [1] does 0.5M sentences/sec on CPU, whereas the fastest sentence_transformers [2] do something like 20k sentences/sec on a V100 (!!).

For the drawbacks:

Word embeddings are only good at similarity search style queries - stuff like paraphrasing.

Negation they'll necessarily struggle with. Since word embeddings are generally summed or averaged into a sentence embedding, a negation won't shift the sentence vector space around the way it would in a LM embedding.

Also things like homonyms are issues, but this is massively overblown as a reason to use LM embeddings (at least for latin/germanic languages).

Most people use LM embeddings because they've been told it's the best thing by other people rather than benchmarking accuracy and performance for their usecase.

1. https://github.com/oborchers/Fast_Sentence_Embeddings

2. https://www.sbert.net/docs/sentence_transformer/pretrained_m...

link

arunsupe 697 days ago

This implementation does not because the query has to be a word. One way to extend it to phrases is to average the vectors of each word in the phrase. Another way is to have a word2vec model that embeds phrases. (The large Google News model I think does have phrases, with spaces converted to underscore). But going from words to phrases opens up a whole new can of worms - how large a phrase to consider from the input stream etc. Plus, I don't think averaging words in the phrase is the same as learning the embedding for the phrase. Sentence embedding models are necessary for that, but they are far too slow for this use case as pointed out by others.

To summarize, this is a simple implementation that works for the simplest use case - semantic matching of words.

link

vunderba 697 days ago

Yep, it would almost be easier to:

- check to see if the search query contains more than one word

- spin up a more modestly sized LLM, such as mistral 7b

- ask the LLM to try to condense / find single word synonym for the user query

- send to sgrep

link

djoldman 697 days ago

Definitely a limitation of word2vec as applied here.

Something like SBERT addresses this.

link