Llama.cpp speculative sampling: 2x faster inference for large models

Y	Hacker News new \| ask \| show \| jobs

	Llama.cpp speculative sampling: 2x faster inference for large models (github.com)
	4 points by bobivl 1019 days ago

1 comments

Small caveat: this is only true for generating text with simple grammar, like code. For human text this doesn't work so well.