| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by regularfry 46 days ago
	Delivery of first phoneme and delivery of the important information don't have to be coupled. Politicians on TV get very good at this particular trick, they've got a set of stock phrases which basically fill time while their brain gets in gear. We just need something to fill the gap so our System 1 doesn't lose confidence in the interaction.

1 comments

HarHarVeryFunny 46 days ago

So you could just locally generate the "You're absolutely right! ..." prefix without even waiting for the response to stream in!

link

6510 46 days ago

Do speech to text on the client and send the text/subtitles along with the audio.

If the connection is truly bad, upload your voice and quantify emotional payload.

link

HarHarVeryFunny 46 days ago

I guess different approaches could be applicable for client to server vs server to client.

For client to server you want low latency, don't care about pauses introduced by communications (the model doesn't care), and could certainly tolerate a fallback to lower bandwidth text only (local SST) or more heavily compressed voice.

For server to client it needs to be high quality voice without pauses, but as the parent was suggesting you could potentially hide response latency (whether due to server or communication degradation) by using a human-like conversational "trick" of at least making some sound before brain is engaged and generating a response. "That's absolutely right! ..." would be a tad annoying, but "Hmm..." might be OK, especially if not done all the time, just as a locally initiated conversational filler when the server is slow to respond.

link

6510 46 days ago

HarHar, that makes me think of those people who start each sentence with your name.

link

HarHarVeryFunny 46 days ago

:) I guess that'd work too if they want to go with the Butt-Head persona!

link

regularfry 46 days ago

I do wonder if you actually need two models here. Audio-to-audio hindbrain on the client, and a beefy text-mode frontal lobe somewhere in the cloud, with the comms between them explicitly trained in as a potentially low-bandwidth steering connection transferring embeddings, not text.

link

6510 43 days ago

https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System

link