Hacker News new | ask | show | jobs
by zhihaojia 1124 days ago
SpecInfer is a system that accelerates generative LLM serving with speculative inference and token tree verification. The key idea is to use an LLM as a token tree verifier instead of an incremental decoder. We show that this reduces LLM inference latency by 2.8x.