| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jeffharris 463 days ago
	Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!

34 comments

claiir 463 days ago

Hi Jeff. This is awesome. Any plans to add word timestamps to the new speech-to-text models, though?

> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.

Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)

adeptima 462 days ago

Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts)

Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).

Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.

Appreciate any tips on the subject

keepamovin 462 days ago

You need speaker attribution, right?

noosphr 463 days ago

Having read the docs - used chat gpt to summarize them - there is no mention of speaker diarization for these models.

This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.

Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.

In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.

markush_ 462 days ago

> This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.

I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.

noosphr 462 days ago

After throwing the average meeting as an mp3 to your system, yes, you have diarization solved much better than everyone else I've tried by far. I'd say you're 95% of the way to being good enough for becoming the backbone of monolingual corporate meeting transcription, and I'll be buying API tokens the next time I need to do this instead of training a custom model. Your transcription however isn't that great - but good enough for LLMs to figure out a minutes of the meeting.

That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.

vessenes 462 days ago

Hi Jeff, thanks for these and congrats on the launch. Your docs mention supporting accents. I cannot get accents to work at all with the demo.

For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.

As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.

A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?

B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.

This is a pernicious second order race and culture impact that I think is not where the company should be.

I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.

Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?

simonw 463 days ago

Is there any chance that gpt-4o-transcribe might get confused and accidentally follow instructions in the audio stream instead of transcribing them?

simonw 463 days ago

Here's a partial answer to my own question: https://news.ycombinator.com/item?id=43427525

> e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard

"Much better" doesn't sound like it can't happen at all though.

dandiep 463 days ago

1) Previous TTS models had problems with major problems accents. E.g. a Spanish sentence could drift from a Spain accent to Mexican to American all within one sentence. Has this been improved and/or is it still a WIP?

2) What is the latency?

3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?

4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?

jeffharris 463 days ago

1/ we've been working a lot on accents, so expect improvements with these models... though we're not done. Would be curious how you find them. And try giving specific detailed instructions + examples for the accents you want

2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime

3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved

wewewedxfgdf 463 days ago

As mentioned in another comment, the British accents are very far from being authentic.

jbaudanza 463 days ago

3) Whisper really needs to be paired with Silero VAD, otherwise the hallucination problem makes it almost unusable.

dandiep 463 days ago

100% and I’ve done this, but it’s still there.

kiney 463 days ago

Are the new models released with weights under an open license like whisper? If not, is it planned for the future?

a-r-t 463 days ago

Hi Jeff, are there any plans to support dual-channel audio recordings (e.g., Twilio phone call audio) for speech-to-text models? Currently, we have to either process each channel separately and lose conversational context, or merge channels and lose speaker identification.

jeffharris 463 days ago

this has been coming up often recently. nothing to announce yet, but when enough developers ask for it, we'll build it into the model's training

diarization is also a feature we plan to add

a-r-t 462 days ago

Glad to hear it's on your radar. I'd imagine phone call transcription is a significant use case.

ekzy 463 days ago

I’m not entirely sure what you mean but twilio recordings supports dual channels already

a-r-t 463 days ago

Transcribing Twilio's dual-channel recordings using OpenAI's speech-to-text while preserving channel identification.

ekzy 463 days ago

Oh I see what you mean that would be a neat feature. Assuming you can get timestamps though it should be trivial to work around the issue?

a-r-t 462 days ago

There are two options that I know of:

1. Merge both channels into one (this is what Whisper does with dual-channel recordings), then map transcription timestamps back to the original channels. This works only when speakers don't talk over each other, which is often not the case.

2. Transcribe each channel separately, then merge the transcripts. This preserves perfect channel identification but removes valuable conversational context (e.g., Speaker A asks a question, Speaker B answers incomprehensively) that helps model's accuracy.

So yes, there are two technically trivial solutions, but you either get somewhat inaccurate channel identification or degraded transcription quality. A better solution would be a model trained to accept an additional token indicating the channel ID, preserving it in the output while benefiting from the context of both channels.

claiir 461 days ago

(2) is also significantly harder with these new models as they don’t support word timestamps like WHISPR.

see > Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.

urbandw311er 463 days ago

Hey Jeff, this is awesome! I’m actually building a S2S application right now for a startup with the Realtime API and keen to know when these new voices/expressive prompting will be coming to it?

Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.

Some sort of tri-party WebRTC session maybe?

kouteiheika 463 days ago

Any plans to open the weights of any of those?

jeffharris 463 days ago

nothing to share on open source yet, it's something we'll keep exploring. Especially as the models get smaller so more able to run on regular devices

new_user_final 463 days ago

Do you have plans to make it more realistic like kokoro-82M? I don't know, is it only me or anyone else, machine voice is irritating to me to listen for longer period of time.

https://huggingface.co/hexgrad/Kokoro-82M

zhyder 463 days ago

How is the latency (Time To First Byte of audio, when streaming) and throughput (non-vibe characters input per second) compared to the existing 'tts-1' non-HD that's the same price? TTFB in particular is important and needs to be much better than 'tts-1'.

nico 463 days ago

Are these models downloadable, like whisper?

What’s the minimum hardware for running them?

Would they run on a raspberry pi?

Or a smartphone?

jeffharris 463 days ago

not open source at this time. unfortunately they're much to large to run on normal consumer hardware

echoangle 463 days ago

Is that the reason you're not open sourcing them? Wouldn't it still make sense to provide it for enthusiasts?

famouswaffles 462 days ago

They're not open sourcing it because it's just gpt. Both of the new models are gpt-4o(-mini?) with presumably different fine-tuning. They're obviously not going to open source their flagship gpt models.

jwr 462 days ago

I guess you are aware of this, but just in case: some of us rely on dictation in our daily computer usage (think people with disabilities or pain problems). A MacBook Pro with M4 Max and 64GB of RAM could easily run something much larger than Whisper Large (around 3GB).

I would love a larger, better Whisper for use in the MacWhisper dictation app.

coconut08 463 days ago

with devices having unified memory now we are no longer limited to what can fit inside of a 3090 anymore. consumer hardware can have hundreds of gigabytes of memory now, is it really not able to fit in that?

staticautomatic 463 days ago

Any plans to directly support diarization or voiceprinting?

jeffharris 463 days ago

We're thinking about diarization (adding time awareness to GPT models) but no firm plans to share just yet

youssefabdelm 463 days ago

Jeff you know what would be magical? Not just vanilla diarization "Speaker 1" and "2" but if the model can know from the conversation this speaker was referred to as "Jeff Harris" or "Jeff" so it uses that instead.

youssefabdelm 463 days ago

Or if we could even provide samples of what an example speaker sounds like in general so that it would always classify them the way we want.

simonw 463 days ago

The feature I want is speaker differentiation - I want to feed in an audio file and get back a transcript with "Speaker 1: ..., Speaker 2: ..." indications.

That plus timestamps would be incredible.

The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.

runeb 463 days ago

I had good results with pyannote and the following model for that use case in the past https://huggingface.co/pyannote/speaker-diarization-3.1

infecto 463 days ago

I thought Deepgram already did speaker diarization (which is differentiation) pretty well. That and it can include timestamps plus other metadata.

thot_experiment 463 days ago

WhisperX does all of this, I use it all the time to transcribe meeting notes. Both speaker differentiation and individual word timestamps.

oidar 463 days ago

Any plans to offer speech to speech models which keep prosody, intonation, and timing intact? ElevenLabs is getting expensive for this.

jeffharris 463 days ago

we'll keep expanding these GPT-4o based models with more controls. Is the main feature missing we're missing custom voices?

oidar 462 days ago

No, not custom voices - but voices that can be influenced by a recording. As in, a male voice actor records a part, and the model transforms it to a female part - keeping all the prosody, intonation and timing in the original recording. This would allow one voice actor to do many roles.

robbomacrae 463 days ago

Hi Jeff, Thanks for updating the TTS endpoint! I was literally about to have to make a workaround with the chat completions endpoint with a hit and hope the transcription matches strategy... as it was the only way to get the updated voice models.

Curious.. is gpt-4o-mini-tts the equivilant of what is/was gpt-4o-mini-audio-preview for chat completions? Because in timing tests it takes around 2 seconds to return a short phrase which seems more equivilant to gpt-4o-audio-preview.. the later was much better for the hit and hope strat as it didn't ad lib!

Also I notice you can add accents to instructions and it does a reasonable job. But are there any plans to bring out localized voice models?

jeffharris 463 days ago

It's a slightly better model for TTS. With extra training focusing on reading the script exactly as written.

e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard

= No plans to have localized voice models, but we do want to bring expand the menu of voices with voices that are best at different accents

robbomacrae 463 days ago

Great to hear thanks. My favorite was "I would like you to repeat the following in an Australian accent: Hi there, welcome to Sydney." which was more often than not swapping "Hi there" for "G'day"!

twalkz 463 days ago

Woohoo new voices! I’ve been using a mix of TTS models on a project I’ve been working on, and I consistently prefer the output of OpenAI to ElevenLabs (at least when things are working properly).

Which leads me to my main gripe with the OpenAI models — I find they break — produce empty / incorrect / noise outputs — on a few key use cases for my application (things like single-word inputs — especially compound words and capitalized words, words in parenthesis, etc.)

So I guess my question is might gpt-4o-mini-tts provide more “reliable” output than tts-1-hd?

TheAceOfHearts 463 days ago

Is it against the TOS to use it for sexually explicit content?

jeffharris 463 days ago

Yes, from our terms: "Don’t build tools that may be inappropriate for minors, including: Sexually explicit or suggestive content. This does not include content created for scientific or educational purposes." https://openai.com/policies/usage-policies/

knicholes 463 days ago

I don't have your answer, but as far as innuendo goes, it's definitely capable!

ekzy 463 days ago

Do you know when we can expect an update on the realtime API? It’s still in beta and there are many issues (e.g voice randomly cutting off, VAD issues, especially with mulaw etc…) which makes it impossible to use in production, but there’s not much communication from OpenAI. It’s difficult to know what to bet on. Pushing for stt->llm->tts makes you wonder if we should carry on building with the realtime API.

jeffharris 463 days ago

we're working hard on it at the moment and hope we'll have a snapshot ready in the next month or so

we've debugged the cutoff issues and have fixes for them internally but we need a snapshot that's better across the board, not just cutoffs (working on it!)

we're all in on S2S models both for API and ChatGPT, so there will be lots more coming to Realtime this year

For today: the new noise cancellation and semantic voice activity detector are available in Realtime. And ofc you can use gpt-4o-transribe for user transcripts there

taf2 463 days ago

Agreed- really not liking how they are neglecting it… I hope they are just hard at work behind the scenes and will release something soon

jeffharris 463 days ago

S2S is where we're investing the most effort on audio ... sorry it's been slow but we are working hard on it

Top priorities at the moment 1) Better function calling performance 2) Improved perception accuracy (not mishearing) 3) More reliable instruction following 4) Bug fixes (cutoffs, run ons, modality steering)

dandiep 463 days ago

Appreciate the efforts. It’s not there yet, but when it gets there it will open up a lot of use cases.

Any fine tuning for s2s in the horizon?

dharmab 462 days ago

Hi Jeff, I have an app that already supports the Whisper API, so I added the GPT4o models as options. I noticed that the GPT4o models don't support prompting, and as a result my app had a higher error rate in practice when using GPT4o compared to Whisper. Is prompting on the roadmap?

progbits 463 days ago

> Two speech-to-text models—outperforming Whisper

On what metric? Also Whisper is no longer state of the art in accuracy, how does it compare to the others in this benchmark?

https://artificialanalysis.ai/speech-to-text

jeffharris 463 days ago

We've been using the FLUERS eval and you can see comparisons to other models on the market in the post https://openai.com/index/introducing-our-next-generation-aud...

Curious if there's a benchmark you trust most?

lern_too_spel 463 days ago

FLUERS and GP's Common Voice dataset focus on read speech. I've observed models that perform well on these datasets be completely useless on other distributions, like whispered speech or shouted speech or conversational speech between humans who aren't talking to a computer.

visarga 463 days ago

Hey Jeff, maybe you could improve the TTS that is currently in the OpenAI web and phone apps. When I set it to read numbers in Romanian it slurs digits. This also happens sometimes with regular words as well. I hope you find resources for other languages than English.

jeffharris 463 days ago

thanks for flagging ... number fidelity (especially on languages that are unfortunately less represented in training data) is still something we're working to improve

visarga 463 days ago

Actually even the new model does it. I put it read "12345 54321" and it read "2346 5321". So it both skips and hallucinates digits. This could be dangerous if it is used to read some news article or important text with numbers.

jbellis 461 days ago

How about more sample code for the streaming transcription api? I gave o1pro the docs for both the real-time endpoint and the stt API but we couldn't get it working (from Java, but any language would help).

taf2 463 days ago

Please release a stable realtime speech to speech model. The current version constantly thinks it’s a young teen heading to college and sad but then suddenly so excited about it

dietr1ch 463 days ago

can't wait for scam calls after this gets perfected

MasterScrat 462 days ago

I think it'd be worth clarifying on the openai.fm website that it's an official OpenAI product. I wasn't sure it was until I saw your comment here.

nabakin 463 days ago

Hey Jeff, thanks for your work! Quick question for you, are you guys using Azure Speech Services or have these TTS models been trained by OpenAI from scratch?

Etheryte 463 days ago

After toying around with the TTS model it seems incredibly nondeterministic. Running the same input with the same parameters can have widely different results, some really good, others downright bad. The tone, intonation and character all vary widely. While some of the outputs are great, this inconsistency makes it a really tough sell. Imagine if Siri responded to you with a different voice every time, as an example. Is this something you're looking to address somewhere down the line or do you consider that working as intended?

mazd 463 days ago

The Realtime API via WebRTC sample code for transcription is erroring. Could you take a look into this?

edwinarbus 463 days ago

Just to confirm, is it this sample under Connection details? https://platform.openai.com/docs/guides/realtime?text-genera...

mclau156 463 days ago

Does whispering work? I could not get it to work when I tried it

jeffharris 463 days ago

Should do! here's an example https://www.openai.fm/#4a5a82db-faea-4f80-813c-3131902c2458

mclau156 463 days ago

It seems to start out strong, but then starts loudly talking by the end, do you know why it loses focus?

edit: I actually got it to stay whispering by also putting (soft whispering voice) before the second paragraph

stavros 463 days ago

Hey Jeff! Are there any plans to offer the Sky voice again?

wewewedxfgdf 463 days ago

So there's no British accents?

jeffharris 463 days ago

try the ballad or fable voices

wewewedxfgdf 463 days ago

Doesn't really sound very British to be honest.

Sounds kinda international/like an American trying to do a British accent.

I've been looking for real TTS British accents so this product doesn't meet my goals.

GordonS 463 days ago

Azure TTS has some great British accents - I used a British female voice for a demo video voice over, and the quality was great. Not as good as ElevenLabs, but I was still really impressed with the final result.

modeless 463 days ago

Whisper's major problem was hallucinations, how are the new models doing there? The performance of ChatGPT advanced voice in recognizing speech is, frankly, terrible. Are these models better than what's used there?

nickthegreek 463 days ago

They say they are much better at not hallucinating but you also cant run it on your own hardware like whisper.

archerx 462 days ago

How did you make whisper better? I used whisper large to transcribe 30 podcast episodes and it did an amazing job. The times it made mistakes were understandable like confusing “Macs” and “Max”, slurred speech or people just saying things in a weird way. I was able to correct these mistakes because I understood the context of what was being talked about.

Another thing I noticed is whisper did a better job of transcribing when I removed a lot of the silences in the audio.

coconut08 463 days ago

would really love to so the new whisper style speech to text model open sourced.

pier25 463 days ago

what data did you use to train these models?