| HN Mirror

Even where the training data does say "I don't know" (which it usually doesn't-- people don't tend to comment or publish books, etc. when they don't think they know) that text is reflecting the author's knowledge rather than the models... so it would be off in both directions.

One could imagine a fine tuning procedure that gave a model better knowledge of itself by testing it and on prompts where its most probable completions are wrong fine tune it to say "I don't know" instead. Though the 'are wrong' is doing some really heavy lifting since it wouldn't be simple to do that without a better model that knew the right answers.