Hacker News new | ask | show | jobs
by toxik 572 days ago
Why do you think that? InstructGPT was predominantly trained as a next-token predictor on whatever soup of data OpenAI curated at the time. The alignment signal (both RL part and the supervised prompt/answer pairs) are a tiny bit of the gradient.