| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nyrikki 490 days ago

While interesting, the title is obviously a bit misleading.

> Our results on a temporally held-out test set of questions resolving after December 25, 2024 show that for both of the models that we employed our method on, Phi-4 14B [15] and DeepSeek-R1 14B [14], we find accuracy improvements of between 7–10% over the base versions of these models as well as the same models fine-tuned with randomized outcome labels as a control

So 7–10% improvement for small models like DeepSeek-R1-Distill-Qwen-14B and Phi-4-14B, approaching GPT-4o.

It would be interesting if the same holds for DeepSeek-R1-Distill-Qwen-32B which in my experience is far superior to to DeepSeek-R1-Distill-Qwen-14B in almost every way, yet still runnable without DC class GPUs

The Ridge Plots of brier scores is probably a good hint if your application chan benefit based on it's tail dependence?

IMHO this paper is all about making small models work better, and nothing suggests anything about frontier models or LLMs in general.

1 comments

bturtel 490 days ago

We're working on a follow up paper now to show similar results with larger models!

link