Hacker News new | ask | show | jobs
by ianand 443 days ago
> LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc...RL learning involves training the models on entire responses, not token-by-token loss (1).

Yes. For those who want a visual explanation, I have a video where I walk through this process including what some of the training examples look like: https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s