| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by whimsicalism 186 days ago
	Makes sense, I think streaming audio->audio inference is a relatively big lift.

1 comments

red2awn 185 days ago

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

link

whimsicalism 185 days ago

I imagine you have to start decoding many speculative completions in parallel to have true low latency.

link