| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ruby314 674 days ago
	yes that's correct! we project an evaluator LLM's internal activations onto meaningful linear directions, derived from contrasting examples. the strongest connection is to LLM interpretability (existence of meaningful linear directions) and steering research (computation from contrast pairs). This has been done with base model activations to understand base model behavior, but we show you can boost evaluation accuracy this way too, with a small number of human feedback

1 comments

autokad 673 days ago

this is pretty interesting work, I am curious to what emerges from this kind of thing in the future. I have been working on something 'similar' that involves the same problem, how do you evaluate a response without a source of truth, and this gets at that.

I was looking at it from the stand point of the embeddings of the output with different temperatures

link