| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wesleyyue 638 days ago

Interesting observations:

* Llama 3.2 multimodal actually still ranks below Molmo from ai2 released this morning.

* AI2D: 92.3 (3.2 90B) vs 96.3 (of Molmo 72B)

* Llama 3.2 1B and 3B is pruned from 3.1 8B so no leapfrogging unlike 3 -> 3.1.

* Notably no code benchmarks. Deliberate exclusion of code data in distillation to maximize mobile on-device use cases?

Was hoping there would be some interesting models I can add to https://double.bot but doesn't seem like any improvements to frontier performance on coding.

2 comments

daemonologist 638 days ago

On the second point, you're comparing MMMU-Pro (multimodal) to MMLU-Pro (text only). I don't think they published scores on MMLU-Pro for 3.2.

(Edit: parent comment was corrected, thanks!)

link

wesleyyue 638 days ago

Yep you're right, thanks for catching (sorry for the ninja edit!)

link

idiliv 638 days ago

Where do you see the MMLU-Pro evaluation for Llama 3.2 90B? On the link I only see Llama 3.2 90B evaluated against multimodal benchmarks.

link

wesleyyue 638 days ago

Ah you're right I totally misread that!

link