|
|
|
|
|
by eric_gu
674 days ago
|
|
This is cool! I have not read up on evaluation techniques that use LLM-as-a-Judge, so I hadn't heard of the term "evaluator LLM" before. Questions that came to mind: - How are you deciding on the positive/negative concept pairs to generate your latent "evaluation direction?" - What layer of activations on the evaluator model do you use—the output layer? - What base model are you using for solving the HaluEval task? - I notice that LSR on Lynx 8B actually does worse than naive completion, and is in fact worse than LSR on Llama-3-8B-Instruct. Why do you think that is? |
|
- We generate contrast pairs (for this post, using gpt-4o) and do some post processing for quality (synthetic data). The impact of different types of contrast pairs is a continuing area of research for us.
- We treat the evaluator model layers as hyper parameters, similar to the steering research (some of which we cite in our “non-comprehensive list of references”). We also see that the middle layers tend to be most effective.
- For base model, we use both Llama-3-8b-Instruct and Llama-3.1-8b-Instruct to show LSR taking advantage of the improved base model (maybe I misunderstood the question?)
- Re: Lynx being worse with LSR, it depends on data source. It's worse for HaluEval but you can see in the PubMedQA table it’s slightly better there. That’s consistent with the analysis in Contrastive Activation Addition https://arxiv.org/pdf/2312.06681 (section 6, sometimes the impacts of fine tuning and latent space steering are cumulative, sometimes the opposite). Would love to know if anyone has seen research as to why.