| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by modeless 110 days ago
	IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.

7 comments

cootsnuck 109 days ago

I've been working solely on voice agents for the past couple years (and have worked at one of the frontier voice AI companies).

The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.

Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.

Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).

But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.

link

nicktikhonov 110 days ago

If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

link

woodson 110 days ago

You mean Moshi (https://github.com/kyutai-labs/moshi)? Since Personaplex is just a finetuned Moshi model.

link

mountainriver 110 days ago

Yeah except moshi doesn’t sound good at all

link

ilaksh 109 days ago

It just about works for our current use case but can't comprehend the concept of an outgoing call. So I am trying to fine tune it. Tricky thing is personaplex forked some of the kyutai code and has not integrated the LoRA stuff they added. So we tried to do update personaplex with the fine tuning stuff. Going to find out tonight or tomorrow whether it's actually feasible when I finish debugg/testing.

link

rockwotj 110 days ago

Fundamentally, the "guessing when its your turn thing" needs to be baked into the model. I think the full duplex mode that Moshi pioneered is probably where the puck is going to end up: https://arxiv.org/abs/2410.00037

link

com2kid 110 days ago

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).

link

eru 109 days ago

You should probably split it up: an end-to-end model for great latency (especially for baked in turn taking), but under the hood it can call out to any old text based model to answer more intricate question. You just need to teach the speech model to stall for a bit, while the LLM is busy.

Just use the same tricks humans are using for that.

link

donpark 110 days ago

But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.

link

tmzt 110 days ago

Gemini Nano is supposedly doing it on device. It looks like something similar should work with Apple GPU and ANE.

link

coppsilgold 109 days ago

Some of the best current voice tokenizers achieve ~12 Hz, that's many more tokens than a regular LLM would use for ultimately the same content.

link

russdill 109 days ago

At least running things locally, such a model completely blows up your latency

link