Y
Hacker News
new
|
ask
|
show
|
jobs
by
whimsicalism
186 days ago
Makes sense, I think streaming audio->audio inference is a relatively big lift.
1 comments
red2awn
185 days ago
Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.
link
whimsicalism
185 days ago
I imagine you have to start decoding many speculative completions in parallel to have true low latency.
link