Hacker News new | ask | show | jobs
by ruby314 674 days ago
yes that's correct! we project an evaluator LLM's internal activations onto meaningful linear directions, derived from contrasting examples. the strongest connection is to LLM interpretability (existence of meaningful linear directions) and steering research (computation from contrast pairs). This has been done with base model activations to understand base model behavior, but we show you can boost evaluation accuracy this way too, with a small number of human feedback
1 comments

this is pretty interesting work, I am curious to what emerges from this kind of thing in the future. I have been working on something 'similar' that involves the same problem, how do you evaluate a response without a source of truth, and this gets at that.

I was looking at it from the stand point of the embeddings of the output with different temperatures