| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by x3haloed 766 days ago

Exactly. I'm not sure if this is brand new or not, but this is definitely on the frontier.

I was literally just thinking about this a few days ago... that we need a multi-modal language model with speech training built-in.

As soon as this thing rolls out, we'll be talking to language models like we talk to each other. Previously it was like dictating a letter and waiting for the responding letter to be read to you. Communication is possible, but not really in the way that we do it with humans.

This is MUCH more human-like, with the ability to interrupt each other and glean context clues from the full richness of the audio.

The model's ability to sing is really fascinating. It's ability to change the sound of its voice -- its pacing, its pitch, its tonality. I don't know how they're controlling all that via GPT-4o tokens, but this is much more interesting stuff than what we had before.

I honestly don't fully understand the implications here.