Hacker News new | ask | show | jobs
by endymi0n 44 days ago
I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?
1 comments

MTP allows for a smaller draft model to supply tokens to the larger model for verification. If tokens are good enough, the larger model can accept them instead of generating its own, which is much cheaper. From what I read, this is not unique to GGUF or MLX format. Instead, the model has to be trained to support that feature.