Hacker News new | ask | show | jobs
by kelseydh 462 days ago
A lot of people forget that it was only recently that the Photos app on your iPhone could run OCR text search on pictures in your phone. Google had that feature on their phones many years before Apple.
1 comments

Apple's TTS voices still run on 10 year old technology. Pretty disappointing, at one time the had the best system voices.
The ML blog seems to disagree with that take: https://machinelearning.apple.com/research/on-device-neural-... (2021)

> Recent advances in text-to-speech (TTS) synthesis, such as Tacotron and WaveRNN, have made it possible to construct a fully neural network based TTS system, by coupling the two components together. [...] However, the high computational cost of the system and issues with robustness have limited their usage in real-world speech synthesis applications and products. In this paper, we present key modeling improvements and optimization strategies that enable deploying these models, not only on GPU servers, but also on mobile devices

Having worked on Apple's TTS for more than a decade, I can state with confidence that this is utter bullshit and you don't have the slightest idea what you are talking about. Both in terms of quality, and of the underlying technology used, Apple's current TTS is in no way comparable to what existed 10 years ago (at Apple, or anywhere else in the industry).

I challenge you to find a 2014 recording that is on par with a contemporary Siri voice.

I have been playing recently with those enhanced TTS model and they are of similar quality like piper TTS models to me - not that good. StyleTTS 2 like kokoro sounds so much better for me and also run realtime on their devices. And when you compare their online models to not even what OpenAI have but some small recent startups like Sesame or open source models like Orpheus, Apple TTS sounds (pun intended) really behind.
I don't dispute your claim, just that I still find Alex voice to be the best, and it's been the same since over 10 years ago. The other voices have issues, they don't sound too good at 1.5x.
Ah, that's more specific.

Alex was developed when VoiceOver (the screen reader) was the primary use case for text to speech. Consequently, it was optimized for low latency and robustness under rate changes.

The Siri voices sound much more natural at 1x and have a higher signal quality, but rate changes were a lower priority for this use case.

Fun fact: when we worked on Alex, many VoiceOver users stubbornly hung on to Fred (which is mostly using late 1970s technology). Screen reader users are not fond of switching voices; it appears their hearing locks in to a particular voice, so switching is costly.