Hacker News new | ask | show | jobs
by ericmcer 91 days ago
Seriously for audio conversations the LLM layer is fairly stable. Getting STT and TTS to be reliable has been a much bigger hurdle.

I hear the same phrases 10+ times in a day and they stress things a bit different each time, it seems like an exceptionally hard problem. My dream of a super reliable [llm output stream -> streaming TTS endpoint -> webRTC audio stream] seems pretty much impossible at this point.

Is the goal to trick people into thinking it is a human or to create a high trust robot? I am hoping as voice agents get more sophisticated the stigma around "It's making me talk to a robot" lessens so we don't need to worry so much about convincing someone it is a real person.