| HN Mirror

It's not RL, but you can get a long way with a thorough system prompt to encourage it to engage in 'thinking' behavior on its own without extra training. Just playing with it myself now with promising results - Mistral Small seems very receptive to this approach (not all models are - cough, Llama).

Update: This is such a prompt: https://gist.github.com/peterc/955d797ee35b3c777d76a2d881d2f...