Hacker News new | ask | show | jobs
by yorwba 827 days ago
Works for:

Bitext mining: Given a sentence in one language, find its translation in a collection of sentences in another language using the cosine similarity of embeddings.

Classification: identify the kind of text you're dealing with using logistic regression on the embeddings.

Clustering: group similar texts together using k-means clustering on the embeddings.

Pair Classification: determine whether two texts are paraphrases of each other by using a binary threshold on the cosine similarity of the embeddings.

Reranking: given a query and a list of potential results, sort relevant results ahead of irrelevant ones by sorting according to the cosine similarity of embeddings.

Etc etc.

These are MTEB benchmark tasks https://arxiv.org/pdf/2210.07316.pdf . If you have no need for something like that, good for you, you don't need to care how well embeddings work for these tasks.

1 comments

Easy there, Firthmiester. I'm familiar with the canon. If getting some desirable behavior in your application is good enough for you then feel free to ignore what I'm saying.
Embeddings and vector stores wouldn't have taken off in the way that they did if they didn't actually work.
They've taken off because they have utility in information retrieval systems. They work for getting info into Google (Stanford) Knowledge Panels. I don't think it really goes any further than that. They are most useful to the few orgs that went from dominating NLP research to controlling it outright by convincing everyone scale is the only way forward and owning scale. Alternatives to word embeddings aren't even considered or discussed. They are assumed as a starting point for pretty much all work in NLP today even though they are as uninteresting today as they were when word2vec was published in 2013. They do not and will not work for language.