| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danielhanchen 380 days ago

Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs and https://x.com/UnslothAI/status/1915476692786962441

For R1 specifically, we did an internal benchmark on the original model - https://unsloth.ai/blog/deepseekr1-dynamic

For R1-0528 specifically on evals - we're still running them :)) It's quite expensive to run, so we first do "vibe check" on some internal test cases, and they do pretty well!

But we generally stress the bug fixes that we do, which objectively increase performance by +1 to sometimes +10% accuracy - for example Llama 4 bug fixes, Gemma bug fixes - https://news.ycombinator.com/item?id=39671146 etc are much more important :)

We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!

1 comments

saurik 379 days ago

> All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <｜endofsentence｜>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100.

I couldn't tell if this was an error in the code running the model or in the model weights themselves; if/assuming the former, are these fixes being upstreamed to anywhere?

link