Hacker News new | ask | show | jobs
by stared 491 days ago
While I like the idea of measuring subsequent steps, this kind of approach of using embeddings is the reason why I wrote: "Don't use cosine distance carelessly" (https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity).

In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.

For anything else... cosine similarity says little. Sometimes, two steps can have opposite conclusions but have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.

A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").

2 comments

> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion

We actually use a variant of this approach in our reasoning prompts. We use structured output to force the LLM to think for 15 steps, and in each step we force it to generate a self-assessed score and then make a decision as to whether it wants to CONTINUE, ADJUST, or BACKTRACK.

  - Evaluate quality with reward scores (0.0-1.0)
  - Guide next steps based on rewards:
    • 0.8+ → CONTINUE current approach
    • 0.5-0.7 → ADJUST with minor changes
    • Below 0.5 → BACKTRACK and try different approach
I go into a bit more depth about it here, with an explicit example of its thinking at the end: https://bits.logic.inc/p/the-eagles-will-win-super-bowl-lix
Every time I see these kinds of prompts that ask an LLM for a numeric ranking, I'm very skeptical that the numbers really mean anything to the model. How does it know what a 0.5 is supposed to be? With humans, you'd have them grade things and then correct the grades so they learn what it is from experience. But unless you specifically fine tune your LLM, this wouldn't apply.
I went through this with gemini-1.5, using it to evaluate responses. Almost everything was graded 8-9/10. To get useful results I did the following. 1. Created a long few-shot prompt with many examples of human graded results. 2. Prompt it to write it's review before it's assesment. 3. Prompt it to include example quotes to justify it's assesment 4. Finally produce a numeric score.

With gemini-2 I've been able to get similar results without the few-shot prompts. Simply by prompting it to not be a sycophant, and explaining why it was important to get realistic, even hard scores, and that i expected most scores to be low, on order for the high scoring content to stand out.

In a recent test, I changed to using word scores, low, medium, high, and very high. Out of about 500 examples none scored very high. I thought that was pretty cool, as when I do find one scoring high it will stand out, and hopefully justify it's score

Yes, you are right.

If we ask LLM to grade something, we must create a prompt with good instructions. Otherwise, we will have no idea what 0.5 means or whether it is given consistently.

(A rule of thumb: Is it likely that various people, not knowing the context of a given task, will give the same grade?)

The most robust approach is to ask to rank things within a task. That is, "given blog post titles, grade them according to (criteria)" rather than asking about each title separately.

Well, you are certainly correct about how cosine sim would apply to the text embeddings, but I disagree about how useful that application is to our understanding of the model.

> In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.

Observing that would be helpful in our understanding of the model!

> For anything else... cosine similarity says little. Sometimes, two steps can have opposite consultation, but they have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.

Yes, that would be good to observe also! But here I think you undervalue the specificity of the OAI embeddings model, which has 3072 dimensions. That's quite a lot of information being captured.

> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").

Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on

Simple example of problem my team ran across.

The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.

But in a larger frame, of "things tightly associated with coffee", they mean something extremely close. Whether these things are opposite from each other, or virtually identical, is a function of your point of view; or, in this context, the generally-meaningful level of discourse.

At scale, I expect having dairy vs non-dairy distance be very small is the more accurate representation of intent.

Of course, I also expect them to be very close and that's the problem with purely relying on embeddings and distance where, in this case, the two things mean entirely opposite preferences on the same topic.

(I think maybe why we sometimes see AI generated search overviews give certain types of really bad answers because the underlying embedding search is returning "semantically similar" results)

> Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on

I think both ways can be the preferable option, depending on how well the embedding space represents the text - and that is mostly dependet on the specific use case and model combination.

So if the embedding space does not correctly project required nuance, then it's often a viable option to get the top_n results and do the rest by utilizing the llm + validation calls.

But i do agree with you, i would always like to work with embeddings rather than some llm output. I think it would be such a great thing to have rock solid embedding space where one would not even consider to look at token predictor models.