what's wild is they accidentally solved it — pretraining IS unsupervised learning at scale, RLHF IS reinforcement learning. they just didnt know the recipe yet
What would unsupervised mean, would unsupervised be something like alphago playing against itself trillions of times?
Whereas self-supervised, allows learning without explicit annotation of data ; but it doesn't matter if the models already trained on the entire Internet, and it's not like a game where it can come up with effectively new training data for itself?
Unsupervised is basically clustering. Alphago is RL - winning or losing a game is a form of supervision.
Unsupervised is something where there is no intrinsic reward signal. In pre training, predicting the next token and seeing that it matches is a reward signal, hence it is self supervised.
fair point — OpenAI's original plan literally said "solve unsupervised learning". the self-supervised distinction wasnt really standard til after BERT/GPT popularized it