| I hate to say it but while this does seem very impressive and a step forward in how we interact with AI, the use-cases they present and the UX both seem unrealistic and/or unhelpful. With the exception of the real-time translation (which seems like it should be a separate product all by itself), none of the use-cases they presented had much utility. I don't want anything to count the number animals in my stories or time a trivia quiz for me. The auto-slouch-detector, while the demo was pretty funny, just seems so dystopian and weird. AI interrupting you to scold you about taking elderly parents mountain biking instead of waiting for you to finish to scold you? No thanks. The UX is also an issue - the model interrupting the user (even when apparently required by these strange use-cases) is jarring and makes one lose their flow. You can even see this in the demo videos that they put out - the employees/actors had to really concentrate to continue speaking as if they weren't being interrupted by a brash robotic machine. A human, when participating in this (rare) "invited interruption" has the ability to speak "under" the main speaker and I feel it's generally timed with a lot of nuance. Even in the auto-translation demo, they ducked the human's audio but the AI steamrolled him and it would have been impossible to actually do that demo without either an incredible amount of control over one's speaking, or (more likely) muting the output. A human translator has a way of "pointing" the "output" to the intended speaker. The very best part of this tech was presented in the first video where it shows the AI not needlessly interrupting the user. This seems to me more of an important bug fixed that the current models still (somehow) have. Maybe a good use-case for this would be counting "um's" and the like while practising public speaking. |
- Voice assistants
- Customer experience
- Gaming
- Meeting assistants
- Real-time coach or user assistant for using software
- Translation
- Real-time work on a computer controlled by voice (frontend / mobile dev, CAD, 3D modeling, etc)
Traditionally a lot of these use cases with LLM agents are higher latency because the model needs to wait for the speaker to finish, then decide to call a tool or respond - if they call a tool they need to process the tool result and decide if they want to call a tool or respond, etc...