| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vjerancrnjak 453 days ago

Same can be achieved without RL. There’s no need to generate a full response to provide loss for learning.

Similarly, instead of waiting for whole output, loss can be decomposed over output so that partial emits have instant loss feedback.

RL, on the other hand, is allowing for more data. Instead of training on the happy path, you can deviate and measure loss for unseen examples.

But even then, you can avoid RL, put the model into a wrong position and make it learn how to recover from that position. It might be something that’s done with <thinking>, where you can provide wrong thinking as part of the output and correct answer as the other part, avoiding RL.

These are all old pre NN tricks that allow you to get a bit more data and improve the ML model.