Hacker News new | ask | show | jobs
by durumu 638 days ago
I think there's a useful distinction between plausible-seeming text that is wrong in some subtle way, vs text that is completely fabricated to match a superficial output format, and the latter is what I wish people used "hallucination" to mean. A clear example of this is when you ask an LLM for some sources, with ISBNs, and it just makes up random titles and ISBNs that it knows full well do not correspond with reality. If you ask "Did you just make that up?" the LLM will respond with something like "Sorry, yes, I made that up, I actually just remembered I can't cite direct sources." I wonder if this is because RLHF teaches the LLM that humans in practice prefer properly formatted fake output over truthful refusals?
2 comments

How does a model "know full well" that it output a fake ISBN?

It's been trained that sources look like plausible-titles + random numbers.

It's been trained that when challenged it should say "oh sorry I can't do this."

Are those things actually distinct?

In fairness, they will also admit they were wrong even if they were right.