Hacker News new | ask | show | jobs
by regularfry 10 days ago
This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

The mechanism isn't the same as speculative decoding. Speculative decoding happens sequentially and (usually) a couple of tokens at a time; diffusion doesn't, and does blocks of text at once. I haven't read the collateral yet but my assumption would be that it's trained to keep the specific experts stable across a diffusion block.

1 comments

Thanks. I found this other comment that links to a very thorough explanation: https://news.ycombinator.com/item?id=48479042
Oh, fascinating. So they did reuse the existing gemma4 MoE.