|
|
|
|
|
by _t89y
827 days ago
|
|
Works for what? The leaderboards? BPE tiktokens, BPE GPT-2 tokens, SentencePiece, GloVe, word2vec, ..., take your pick, they all end up in a latent space of arbitrary dimensionality and arbitrary vocab size where they can be mapped onto images. This is never going to work for language. The only thing the leaderboards are good for is enabling you to charge more for your model than everyone else for a month or two. The only meaning hyperparameters like dimensionality and vocab size have is in their message that more is always better and scaling up is what matters. |
|
Bitext mining: Given a sentence in one language, find its translation in a collection of sentences in another language using the cosine similarity of embeddings.
Classification: identify the kind of text you're dealing with using logistic regression on the embeddings.
Clustering: group similar texts together using k-means clustering on the embeddings.
Pair Classification: determine whether two texts are paraphrases of each other by using a binary threshold on the cosine similarity of the embeddings.
Reranking: given a query and a list of potential results, sort relevant results ahead of irrelevant ones by sorting according to the cosine similarity of embeddings.
Etc etc.
These are MTEB benchmark tasks https://arxiv.org/pdf/2210.07316.pdf . If you have no need for something like that, good for you, you don't need to care how well embeddings work for these tasks.