Hacker News new | ask | show | jobs
by raybb 228 days ago
"once the user stops talking" is a key insight here for me. When using this I wasn't intentionally pausing to let it figure out an answer. It seemed to just pop up while I was talking. But upon experimenting some more it does seem to wait until here's a bit of a pause most of the time.

However it's still wild to me how fast and responsive it is. I can talk for 10 seconds and then in ~500ms I see the updates. Perhaps it doesn't even transcribe and rather feeds the audio to a multimodal llm along with whatever tasks it already knows about? Or maybe it's transcribing live as you talk and when you stop it sends it to the llm.

Anyone have a sense of what model they might be using?

1 comments

I cannot remember off the top of my head the exact number and am clearly too lazy to google it but there is a specific length of time in which, if no new noises pass through, the human brain processes it as a pause/silence.

I want to say 300ms which would coincide with your 500ms example

This is definitely dependent on individuals. It’s a reason during some conversations people can never seem to get a word in edgewise, even if the person speaking may think they’re providing opportunities do so. A mismatch in “pause length” can make for frustrating communications.

I am also too lazy to google or AI it but it’s something I remember from when I taught ESL long ago.

That makes sense! To be honest I’m referring to my audio engineering degree and the pause was specific to noticing silence in audio so I’d 100% agree that in conversation it can vary between people as I know some many people who will not let you get a word in