Hacker News new | ask | show | jobs
by autokad 674 days ago
if what I understand is correct, that they project the LLM's internal activations into meaningful linear directions derived from contrasting examples, I guess this is similar to how we began to derive a lot more value from ebeddings by using the embedding values for various things.
1 comments

yes that's correct! we project an evaluator LLM's internal activations onto meaningful linear directions, derived from contrasting examples. the strongest connection is to LLM interpretability (existence of meaningful linear directions) and steering research (computation from contrast pairs). This has been done with base model activations to understand base model behavior, but we show you can boost evaluation accuracy this way too, with a small number of human feedback
this is pretty interesting work, I am curious to what emerges from this kind of thing in the future. I have been working on something 'similar' that involves the same problem, how do you evaluate a response without a source of truth, and this gets at that.

I was looking at it from the stand point of the embeddings of the output with different temperatures