|
|
|
|
|
by ruby314
674 days ago
|
|
yes that's correct! we project an evaluator LLM's internal activations onto meaningful linear directions, derived from contrasting examples. the strongest connection is to LLM interpretability (existence of meaningful linear directions) and steering research (computation from contrast pairs). This has been done with base model activations to understand base model behavior, but we show you can boost evaluation accuracy this way too, with a small number of human feedback |
|
I was looking at it from the stand point of the embeddings of the output with different temperatures