Hacker News new | ask | show | jobs
by npn 7 days ago
How?

edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.

though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.

2 comments

They say they are using https://github.com/tile-ai/TileRT

- persistent CUDA kernel

- tiled processing with overlapping read/writes

- model designed with specific constraints in mind

Excuse me, do aliens live among us? 17 commits, 99% Python and multiplying the speed of GLM, Deepseek V4, MiMO 2.5?
tilert is closed source, the repo is just a python wrapper that invokes the binary.
i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account)

Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM

I doubt you can do that. MTP magic happens because for texts, we have a lot of low value fixed tokens that almost always get generated in the sequence (like punctuation, function words, language keywords etc). for most important ones (the entities, the content words, variables) you still need the full model.

so there is alwasy a maximum limit for how well MTP can do.