my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.
https://huggingface.co/blog/rlhf#reward-model-training