| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hooloovoo_zoo 78 days ago
	One sentence summary: We fine-tuned a general-purpose model to produce valid benchmark code results and it got better at producing benchmark code results; we didn't bother to evaluate it on anything the model used to be good at.

1 comments

andy_xor_andrew 78 days ago

Not really? If you read it, there is no validation, no correctness signal, no verification, none of that. They're just passing in benchmark inputs, collecting the outputs (regardless of their quality), training on those outputs, and then sweeping the decode settings (temp, topk) of the resulting model. Their conclusion is that this results in a better model than the original - even when taking into consideration the same temp/topk sweep of the original.

So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."

link

fpgaminer 78 days ago

Not only that, they additionally ran an experiment with the training temperature turned way up (2.0) and truncation turned off such that the majority of SFT examples were incoherent (63% IIRC). Yet the model finetuned on these broken examples still improved over baseline.

link

krackers 77 days ago

Maybe this vaguely still makes sense in some way, because there is actually some useful signal purely in the model "internalizing" the behavior of its own sampler.

I don't know enough to say anything more formal, but it feels like exposing the model to its own output might help it "learn" to work with the sampler to get to a goal. I know that this is partly one of the reasons why RL is helpful, because aside from shifting the output towards a specific reward (rlvr or rlhf) it's also the only place where things are optimized at an actual "end to end sampled sequence of tokens" level instead of "next logits level" like in pretraining (which is why the highest probability suffix completion isn't necessarily simply greedy highest logit choices)

link

hooloovoo_zoo 78 days ago

They are training the model to 1. Produce code (as opposed to answer a question, write a poem, etc.) 2. Produce long enough output to be a valid solution. So they are doing exactly what I said. Cheers.

link

mememememememo 78 days ago

In layman, they are putting wet tyres on when it is raining and saying the car performs better over the next lap?

link