|
|
|
|
|
by matusp
180 days ago
|
|
AI alignment is not a solved problem by any means. As long as LLMs hallucinate, they cannot be considered aligned. You can only be aligned if you have a zero probability of generating hallucinations. The two problems, alignment and hallucinations, can be considered equivalent. |
|
Alignment is, approximately, "are we even training this AI on the correct utility function?" followed up by the second question "even if we specified the correct utility function, did the AI learn a representation of that function or some weird approximation of that function with edge cases we've not figured out how to spot?"
With, e.g. RLHF, the first is "is optimising for thumbs-up/thumbs-down the right objective at all?", the second is "did it learn the preference, or just how to game the reward?"