Hacker News new | ask | show | jobs
by justlikereddit 412 days ago
>It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt.

If you load the system prompt with enough assumptions that it's a speech-impared subtitle transcription that follows a dialogue you might pull it off, but likely you might need to fine tune your model to play nicely with the TTS and rest of setup