| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spuz 665 days ago
	Is there anyone besides OpenAI working on a speech to speech model? I find it incredibly useful and it's the sole reason that I pay for their service but I do find it very limited. I'd be interested to know if any other groups are doing research on voice models.

3 comments

Ey7NFZ3P0nzAe 665 days ago

Yes. Kyutai released an opened model called moshi : https://github.com/kyutai-labs/moshi

There's also llama-omni and a few others. None of them are even close to 4o from an LLM standpoint. But moshi is called a "foundational" model and U'm hopeful it will be enhanced. Also there's not yet support for those on most backends like llamacpp / ollama etc. So I'd say we're in a trough but we'll get there.

link

russ 665 days ago

There’s Ultravox as well (from one of the creators of WebRTC): https://github.com/fixie-ai/ultravox

Their model builds a speech-to-speech layer into Llama. Last I checked they have the audio-in part working and they’re working on the audio-out piece.

link

0x1ceb00da 665 days ago

When I asked advanced voice mode it said that it receives input as audio and generates text as output.

link

mbrock 665 days ago

It is mistaken because it has no particular insight into its own implementation. In fact the whole point is that it directly consumes and produces audio tokens with no text. That's why it's able to sing, make noises, do accents, and so on.

link