|
|
|
|
|
by danielmarkbruce
621 days ago
|
|
There is a lot of nonsense in here, for example: > but we know that synthetic datasets make for poor training data This is a silly generalization. Just google "synthetic data for training LLMs" and you'll find a bunch of papers on it. Here's a decent survey: https://arxiv.org/pdf/2404.07503 It's very likely o1 used synthetic data to train the model and/or the reward model they used for RLHF. Why do you think they don't output the chains...? They literally tell you - competitive reasons. Arxiv is free, pick up some papers. Good deep learning texts are free, pick some up. |
|
Training a model on synthetic data (obviously) increases bias present in the initial dataset[1], making for poor training data.
IIRC (this subject is a little fuzzy for me) using synthetic data for RLHF is equivalent to just using dpo, so if they did RLHF it probably wasn’t with synthetic data. They may have gone with dpo, though.
[1] https://arxiv.org/html/2403.07857v1