| Some other follow up reflections 1. I wish that Y-axes would switch to be logit instead of linear, to help see power-law scaling on these 0->1 measures. In this case, 20% -> 80% it doesn't really matter, but for other papers (eg. [2] below) it would help see this powerlaw behavior much better. 2. The power law behavior of inference compute seems to be showing up now in multiple ways. Both in ensembles [1,2], as well as in o1 now. If this is purely on decoding self-reflection tokens, this has a "limit" to its scaling in a way, only as long as the context length. I think this implies (and I am betting) that relying more on multiple parallel decodings is more scalable (when you have a better critic / evaluator). For now, instead of assuming they're doing any ensemble like top-k or self-critic + retries, the single rollout with increasing token size does seem to roughly match all the numbers, so that's my best bet. I hypothesize we'd see a continued improvement (in the same power-law sort of way, fundamentally along with the x-axis of "flop") if we combined these longer CoT responses, with some ensemble strategy for parallel decoding and then some critic/voting/choice. (which has the benefit of increasing flops (which I believe is the inference power-law), while not necessarily increasing latency) [1] https://arxiv.org/abs/2402.05120
[2] https://arxiv.org/abs/2407.21787 |