Hacker News new | ask | show | jobs
by pizza 1158 days ago
RLHF seems to suggest that human feedback to tune the model after plain textual data pretraining is quite potent per sample. There might be some optimal ratio of data+model size:rlhf size that works quite favorably for us in getting hallucinations to a minimum. Furthermore there might be some “there” there, in the hallucinations, that has yet to be identified as valuable in itself. Either way it seems like our ability to wrangle these models is getting better