| HN Mirror

> as part of/after the main training

I take this to mean during weight updates, e.g. training.

> "runtime train of thought"

I take runtime here to mean inference, not during RL. What does runtime mean to you?

Previous approaches [0] successfully used inference time chain of thought to improve model responses. That has nothing to do with RL though.

The grandparent is wrong about the paper. They are doing chain of thought responses during training and doing RL on that to update the weights, not just during inference/runtime.

[0] https://arxiv.org/abs/2201.11903