Hacker News new | ask | show | jobs
by AaronFriel 706 days ago
In autoregressive models we can "feed forward" the model by injecting additional tokens. Computing the KV cache entries for those tokens (called"prefill"), then resuming decoding. If we can do this quickly, and on the same node that has a hot KV cache (or otherwise low latency access to shared KV cache), we are quite a ways closer to offering a full duplex, or at least near zero latency, language model API. This does require a full duplex connection (i.e.: Websocket).

For true full duplex communication, including interruption, it will be more challenging but should be possible with current model architectures. The model may need to be able to emit no-op or "pause" tokens or be used as the VAD, and positional encoding of tokens might need to be replaced or augmented with time and participant.

I imagine the first language model which has "awkward pauses" is only a year or so away.