|
|
|
|
|
by justlikereddit
412 days ago
|
|
>It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt. If you load the system prompt with enough assumptions that it's a speech-impared subtitle transcription that follows a dialogue you might pull it off, but likely you might need to fine tune your model to play nicely with the TTS and rest of setup |
|