|
|
|
|
|
by ruby314
674 days ago
|
|
Thanks for sharing the reference. It's definitely related! But the approach in that paper is different in a couple of very significant ways: 1. They train a classifier on top of the activations of the base LLM. so you would have to have access to the activations of the LLM you are using for the task. We use the activations of an open evaluator LLM (e.g. Llama-3.1-8B-Instruct) 2. Our features are projections - a few numbers that derive their predictive power from the linear directions we project onto. That allows us to fit our classifier with 10s of samples instead of 1000s to fit a feedforward neural network on top of the high dimensional activation space Would be interesting to compare the accuracy of the two approaches on the same benchmark, but our approach is broadly applicable even when you aren't working with an open LLM, and doesn't require a ton of training data |
|
An advantage of being aware of early papers is that it accumulates citations so you can often find good works in reverse citations. I had a brief look and the following seems interesting:
GRATH: Gradual Self-Truthifying for Large Language Models https://arxiv.org/abs/2401.12292
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space https://arxiv.org/abs/2402.17811
In contrast, Lynx may be a better model and HaluBench may be a better benchmark, but the paper is too new so it has zero reverse citations on Google Scholar at the moment. Interestingly, Lynx paper does cite Azaria 2023 although in a very cursory way.