|
|
|
|
|
by modeless
110 days ago
|
|
IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project. |
|
The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.
Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.
Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).
But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.