| HN Mirror

>...this seems to be working well for them.

Is this because the users are streaming audio in a more conversational style?

For example, when you give siri a command, it is stated, and then you stop speaking.

For most of ChatGPT‘s life, in openAI’s iOS app, if you wanted to speak to input text, you would tap the record button, and then tap it off, either using the app’s own Speech to text capability or siri’s input field speech to text.

Conversational speech to text is more ongoing, though, which would make a 10 second cold start OK, because you don’t sense as much lag because you’re continuing to speak.

Or perhaps people in general record input longer than 10 seconds, And you are sending the first chunk as soon as possible to get whisper going.

Then follow up chunks are handled as warm boots? Then the text is reassembled? Is that roughly correct?

Anything you can provide on sort of the request and data flow that works with a longer cold boot time in the context of single recording versus streaming, and how audio is broken up would be helpful.