| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ThomasBb 79 days ago

Beyond the models getting better; there are still huge gains available in the inference engine side with new tricks like Dflash, MRT, turboquant - for some usecases these can multiply the speeds. There are even some model specific optimized kernels like for DeepSeek 4 flash that seem wild.

Makes me feel we are nowhere near the optimum yet.

Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df...

https://x.com/bindureddy/status/2052982206344409242?s=46

1 comments

brrrrrm 79 days ago

what's MRT?

link

ThomasBb 79 days ago

Sorry, autocorrect got me there: MTP is what I meant.

link