Hacker News new | ask | show | jobs
by emreckartal 616 days ago
Ah, I see now.

To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.

The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.

Hope this clears things up!

1 comments

I understand that. The problem is that in many scenarios users would want to see transcripts of what they said alongside the model output. Like if I have a chat with a model about choosing a place to move to, I would probably also want to review it later. And when I review it, I will see: me: /audio record/ AI: 200-300m. No easy way to see at glance what the AI answer was about.
You can just run whisper on the conversations as a background job populating the text versions of all the user inputs, so it doesn't interfere with the real-time latency.
It's not going to match what model hears.