|
|
|
|
|
by zackangelo
552 days ago
|
|
Where did you see that? I thought they used an 8b model for their reward model? > To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision |
|
See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...
In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".