| HN Mirror

A human who hates maths is different from one who adds up wrong because they think the first digit counts units, second digit how many tens, third digit how many twenties (as one of my uni lecturers recounted of her own childhood).

Alignment is, approximately, "are we even training this AI on the correct utility function?" followed up by the second question "even if we specified the correct utility function, did the AI learn a representation of that function or some weird approximation of that function with edge cases we've not figured out how to spot?"

With, e.g. RLHF, the first is "is optimising for thumbs-up/thumbs-down the right objective at all?", the second is "did it learn the preference, or just how to game the reward?"