| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by artur44 185 days ago
	A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.

1 comments

pugio 185 days ago

There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

link

artur44 185 days ago

True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.

link

regularfry 184 days ago

There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.

link