| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by valec 705 days ago
	it does. that's what the "direct preference" part of DPO means. you just avoid training an explicit reward model on it like in rlhf and instead directly optimize for log probability of preferred vs dispreferred responses

1 comments

meroes 704 days ago

What is it called when humans interact with a model through lengthy exchanges (mostly humans correcting the model’s responses to a posed question to the model, mostly through chat and labeling each statement by the model as correct or not), and then all of that text (possibly with some editing) is fed to another model to train that higher model?

Does this have a specific name?

link

dartos 704 days ago

I don’t think that process has a specific name. It’s just how training these models works.

Conversations you have with like chatgpt are likely stored, then sorted through somehow, then added to an ever growing dataset of conversations that would be used to train entirely new models.

link