| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by _t89y 827 days ago
	Works for what? The leaderboards? BPE tiktokens, BPE GPT-2 tokens, SentencePiece, GloVe, word2vec, ..., take your pick, they all end up in a latent space of arbitrary dimensionality and arbitrary vocab size where they can be mapped onto images. This is never going to work for language. The only thing the leaderboards are good for is enabling you to charge more for your model than everyone else for a month or two. The only meaning hyperparameters like dimensionality and vocab size have is in their message that more is always better and scaling up is what matters.

1 comments

yorwba 827 days ago

Works for:

Bitext mining: Given a sentence in one language, find its translation in a collection of sentences in another language using the cosine similarity of embeddings.

Classification: identify the kind of text you're dealing with using logistic regression on the embeddings.

Clustering: group similar texts together using k-means clustering on the embeddings.

Pair Classification: determine whether two texts are paraphrases of each other by using a binary threshold on the cosine similarity of the embeddings.

Reranking: given a query and a list of potential results, sort relevant results ahead of irrelevant ones by sorting according to the cosine similarity of embeddings.

Etc etc.

These are MTEB benchmark tasks https://arxiv.org/pdf/2210.07316.pdf . If you have no need for something like that, good for you, you don't need to care how well embeddings work for these tasks.

link

_t89y 827 days ago

Easy there, Firthmiester. I'm familiar with the canon. If getting some desirable behavior in your application is good enough for you then feel free to ignore what I'm saying.

link

minimaxir 827 days ago

Embeddings and vector stores wouldn't have taken off in the way that they did if they didn't actually work.

link

_t89y 824 days ago

They've taken off because they have utility in information retrieval systems. They work for getting info into Google (Stanford) Knowledge Panels. I don't think it really goes any further than that. They are most useful to the few orgs that went from dominating NLP research to controlling it outright by convincing everyone scale is the only way forward and owning scale. Alternatives to word embeddings aren't even considered or discussed. They are assumed as a starting point for pretty much all work in NLP today even though they are as uninteresting today as they were when word2vec was published in 2013. They do not and will not work for language.

link