Hacker News new | ask | show | jobs
by ruby314 674 days ago
Thanks for sharing the reference. It's definitely related! But the approach in that paper is different in a couple of very significant ways:

1. They train a classifier on top of the activations of the base LLM. so you would have to have access to the activations of the LLM you are using for the task. We use the activations of an open evaluator LLM (e.g. Llama-3.1-8B-Instruct)

2. Our features are projections - a few numbers that derive their predictive power from the linear directions we project onto. That allows us to fit our classifier with 10s of samples instead of 1000s to fit a feedforward neural network on top of the high dimensional activation space

Would be interesting to compare the accuracy of the two approaches on the same benchmark, but our approach is broadly applicable even when you aren't working with an open LLM, and doesn't require a ton of training data

2 comments

I didn't mean to suggest it as a competition to the method presented (LSR: latent space readout). It is old after all. LSR's use in evaluator LLM and working with small samples (because it works with linear direction) does seem novel and useful to me.

An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations. I had a brief look and the following seems interesting:

GRATH: Gradual Self-Truthifying for Large Language Models https://arxiv.org/abs/2401.12292

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space https://arxiv.org/abs/2402.17811

In contrast, Lynx may be a better model and HaluBench may be a better benchmark, but the paper is too new so it has zero reverse citations on Google Scholar at the moment. Interestingly, Lynx paper does cite Azaria 2023 although in a very cursory way.

"An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations." - absolutely.

Thanks again for sharing interesting references. Cool that GRATH uses contrast pairs in an iterative process with DPO and TruthX is steering (using the term broadly) with a creating architecture to determine the inference time edits.

One thing about Lynx and HaluBench - as we understand it, Halubench is the test set for Lynx's training data. They do have a couple of held out data sources besides the four they train with, but as far as we could tell from their paper they use the same hallucination-inducing function. Be curious to hear your thoughts on that.

and yeah, we fully acknowledge that our list of references is not comprehensive :) thanks again for sharing!
This is cool! I have not read up on evaluation techniques that use LLM-as-a-Judge, so I hadn't heard of the term "evaluator LLM" before.

Questions that came to mind:

- How are you deciding on the positive/negative concept pairs to generate your latent "evaluation direction?"

- What layer of activations on the evaluator model do you use—the output layer?

- What base model are you using for solving the HaluEval task?

- I notice that LSR on Lynx 8B actually does worse than naive completion, and is in fact worse than LSR on Llama-3-8B-Instruct. Why do you think that is?

Ty!

- We generate contrast pairs (for this post, using gpt-4o) and do some post processing for quality (synthetic data). The impact of different types of contrast pairs is a continuing area of research for us.

- We treat the evaluator model layers as hyper parameters, similar to the steering research (some of which we cite in our “non-comprehensive list of references”). We also see that the middle layers tend to be most effective.

- For base model, we use both Llama-3-8b-Instruct and Llama-3.1-8b-Instruct to show LSR taking advantage of the improved base model (maybe I misunderstood the question?)

- Re: Lynx being worse with LSR, it depends on data source. It's worse for HaluEval but you can see in the PubMedQA table it’s slightly better there. That’s consistent with the analysis in Contrastive Activation Addition https://arxiv.org/pdf/2312.06681 (section 6, sometimes the impacts of fine tuning and latent space steering are cumulative, sometimes the opposite). Would love to know if anyone has seen research as to why.

Thanks for the reply!

Re question #3: I'm not sure I understand why you need to vary the base model or how doing so would allow LSR to take advantage? Isn't your LSR technique used on the activations of the evaluator model?

As a note of feedback, I found the original article a bit hard to understand even with multiple reads. I would have really benefited from a traditional "methodology" section like in an ML paper! The graphs upfront don't make sense to someone who isn't familiar with the problem setting, and even now I'm not sure if the x-axis in the HaluEval Benchmark bar chart refers to the base model or the evaluator model. Maybe it's just me.

Re #3 - my bad, mixing terminology in my answer above. It’s the “base model” for the evaluator model (vs a fine tuned evaluator model). Just using the labeled Halubench dataset as the outputs to be evaluated, so no base model for the Halueval task.

Thanks for the feedback, really helpful. We may edit for clarity.

Ah understood. Makes sense now!