Hacker News new | ask | show | jobs
by TeMPOraL 342 days ago
The point is, there is no hard boundary. The LLM too may know[0] following instructions in data isn't part of transcription task, and still decide to do it.

--

[0] - In fact I bet it does, in the sense that, doing something like Anthropic did[1], you could observe relevant concepts being activated within the model. This is similar to how it turned out the model is usually aware when it doesn't know the answer to a question.

[1] - https://www.anthropic.com/news/tracing-thoughts-language-mod...

1 comments

If you can measure that in a reliable way then things are fine. Mixup prevented.

If you just ask, the human is not likely to lie but who knows with the LLM.