Hacker News new | ask | show | jobs
by RandyOrion 5 days ago
Thanks gemma team for this release.

Compared to autoregressive decoding, diffusion is huge for local MoE inference because of the improved token generation efficiency, especially for normal GPU + ram offload setting.

However, there are models which are better positioned on the performance vs memory pareto front, i.e. dense models, so I'll just wait.

P.S. QAT is really something as it reduces the performance fluctuations compared to the normal one. Thanks again.