|
|
|
|
|
by artur44
185 days ago
|
|
A simple way is to split the model’s output stream before TTS.
Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio. |
|