| HN Mirror

Additionally, I think you may or may not be suspecting research malpractice. Obviously I don't have insider knowledge, but I would note that the idea of training probes in the middle layer of the model wasn't their idea. This paper cites other papers that already did exactly that. The contribution of this paper is simply that focusing on the middle layers for certain "critical tokens" gives a better signal than just checking the middle layers on every token.

It's of course possible that this paper in particular is fraudulent, but note that there is a field of research making the same basic claim as this paper, so this isn't some one off thing. A reasonable amount of people from different institutions would need to be in on it for the entire field to be fraudulent.

Alternatively, I think you may be objecting to the use of the word "truthfulness" in the abstract of the paper, because you seem to think that only human thoughts can possibly have a true or false value. I'm not actually going to object to the idea that only human thoughts can be true or false, but like the response I wrote to your koan comment, the user can interpret the LLMs output, which gives the user's thought a true or false value.

In this case, philosophically, you can think of this paper as trying to find cases where the LLM outputs strings that the user interprets as false. I think the authors of the paper are probably thinking about true or false more as a property of sentences, and thus a thing mere strings can possess regardless of how they are created. This is also a philosophically valid way to look at it, but differs from your view in a way that possibly made you think their claims absurd.