| Well, you are certainly correct about how cosine sim would apply to the text embeddings, but I disagree about how useful that application is to our understanding of the model. > In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD. Observing that would be helpful in our understanding of the model! > For anything else... cosine similarity says little. Sometimes, two steps can have opposite consultation, but they have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle. Yes, that would be good to observe also! But here I think you undervalue the specificity of the OAI embeddings model, which has 3072 dimensions. That's quite a lot of information being captured. > A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5"). Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on |
The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.