Hacker News new | ask | show | jobs
by ianbicking 385 days ago
I've been using OpenAI's new models a lot lately (https://www.openai.fm/)... separating instructions from the spoken word is an interesting choice, and I'm assuming also has a lot to do with OpenAI/GPT using "instructions" across their products, and maybe they are just more comfortable and familiar generating the data and do the training for that style.

Separate instructions is a bit awkward, but does allow mixing general instructions with specific instructions. Like I can concatenate output-specific instructions like "voice lowers to a whisper after 'but actually', and a touch of fear" with a general instruction like "a deep voice with a hint of an English accent" and it mostly figures it out.

The result with OpenAI feels much less predictable and of lower production quality than Eleven Labs. But the range of prosidy is much larger, almost overengaged. The range of _voices_ is much smaller with OpenAI... you can instruct the voices to sound different, but it feels a little like the same person doing different voices.

But in the end OpenAI's biggest feature is that it's 10x cheaper and completely pay-as-you-go. (Why are all these TTS services doing subscriptions on top of limits and credits? Blech!)

3 comments

That's the reason I don't use Elevenlabs and go with worse solutions, I don't want to feel like I'm paying for a whole chunk of compute, whether I use it or not, every single month, with only the option to pay for a yet larger chunk of compute if I run out.

Terrible pricing model, in my opinion.

> The result with OpenAI feels much less predictable and of lower production quality than ElevenLabs

Thank you Ian! Credit to our research team for making this possible

For the prosidy, if you choose an expressive voice the prosidy should be larger

The word is “prosody”, right?
Ninjaing in to ask: is v3 on the roadmap for your voice agents? The quality increase is huge.
Yep, low latency models are on the way.
> But in the end OpenAI's biggest feature is that it's 10x cheaper and completely pay-as-you-go. (Why are all these TTS services doing subscriptions on top of limits and credits? Blech!)

Is it so, after all the LLM and overheads have been considered? Elevenlabs conversational agents are priced at 0.08 per minute at the highest tier. How much is the comparable at Open AI? I did a rough estimate and found it was higher there than at Elevenlabs. Although my napkin calculations could also be wrong.

It's confusing, but if I look closer then 10x is an exaggeration, it's more like 5x...

https://elevenlabs.io/pricing

Creator tier (lowest tier that's full service) is $22/mo for 250 minutes, $0.08/minute. Then it's $0.15/1000 characters. (So many different fucking units! And these prices are actually "credits" translated to other units; I fucking hate funny-money "credits")

https://platform.openai.com/docs/pricing#transcription-and-s...

Estimated $0.015/minute (actually priced based on tokens; yet more weird units!)

The non-instruction models are $0.015/1000 characters.

It starts getting more competitive when you are at the highest tier at ElevenLabs ($1320/month), but because of their pricing structure I'm not going to invest the time in finding out if it's worth it.

> It starts getting more competitive when you are at the highest tier at ElevenLabs ($1320/month), but because of their pricing structure I'm not going to invest the time in finding out if it's worth it.

They do have a grant programme through, which gives 3 months free of the largest tier.

https://elevenlabs.io/startup-grants