|
|
|
|
|
by emreckartal
616 days ago
|
|
Ah, I see now. To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text. The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience. Hope this clears things up! |
|