| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by LesZedCB 1186 days ago

isn't that pretty much what they are doing anyway?

my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.

https://huggingface.co/blog/rlhf#reward-model-training