|
|
|
|
|
by avarun
767 days ago
|
|
Yeah, same question here. Building pipelines for bridging LLMs and TTS and STT models with lower latency is fine and all, but when you compare to a natively multimodal model like GPT-4o it seems strictly inferior. The future is clearly voice-native models that are able to understand nuances in voice and speech patterns, and it's not exactly a distant future. |
|