Hacker News new | ask | show | jobs
by drag0s 360 days ago
one example where non-thinking matters would be latency-sensitive workflows, for example voice AI.
1 comments

Correct, though pretty much anything end-user facing is latency-sensitive, voice is a tiny percentage. No one likes waiting, the involvement of an LLM doesn't change this from a user PoV.
I wonder if you can hide the latency, especially for voice?

What I have in mind is to start the voice response with a non-thinking model, say a sentence or two in a fraction of a second. That will take the voice model a few seconds to read out. In that time, you use a thinking model to start working on the next part of the response?

In a sense, very similar to how everyone knows to stall in an interview by starting with 'this is a very good question...', and using that time to think some more.