Hacker News new | ask | show | jobs
by GaggiX 660 days ago
This is just STT+LLM+TTS, GPT-4o voice mode that is being released uses a single model to listen and generate audio tokens, this allows a much better understanding of the environment (like understanding two people talking at the same time) and a much more powerful speech generation (like singing).