| HN Mirror

They can, and they already do it somewhat. We've found enough to know that.

As the most well known example: Anthropic examined their AIs and found that they have a "name recognition" pathway - i.e. when asked about biographic facts, the AI will respond with "I don't know" if "name recognition" has failed.

This pathway is present even in base models, but only results in consistent "I don't know" if AI was trained for reduced hallucinations.

AIs are also capable of recognizing their own uncertainity. If you have an AI-generated list of historic facts that includes hallucinated ones, you can feed that list back to the same AI and ask it about how certain it is about every fact listed. Hallucinated entries will consistently have less certainty. This latent "recognize uncertainty" capability can, once again, be used in anti-hallucination training.

Those anti-hallucination capabilities are fragile, easy to damage in training, and do not fully generalize.

Can't help but think that limited "self-awareness" - and I mean that in a very mechanical, no-nonsense "has information about its own capabilities" way - is a major cause of hallucinations. An AI has some awareness of its own capabilities and how certain it is about things - but not nearly enough of it to avoid hallucinations consistently across different domains and settings.