Hacker News new | ask | show | jobs
by SlipperySlope 4150 days ago
Compare the Watson text-to-speech voices with Nuance ...

Watson http://text-to-speech-demo.mybluemix.net/

Nuance http://www.nuance.com/for-business/text-to-speech/vocalizer/...

I prefer the Watson version voicing a sample paragraph. Both are good enough for an application that selects on price. For a voice-first application, maybe Watson is better for TTS.

For speech to text, Nuance has been the leader, e.g. Apple's Siri. Has anyone compared IBM speech recognition to Nuance, Microsoft & Google?

5 comments

We know we have strong core speech technology based on various comparisons we have done in the context of competitive evaluations done in conjunction with various government funded speech programs. However, our service is still very new. We could have waited for months to tune it, but our primary goal here is to solicit feedback from the community for how to make our services easier to use, especially in the context of our other platform services. We don't want to wait till the design is so mature that it is impossible to change - so any and all feedback is very welcome!
I run an human powered audio transcription service and I'd be very interested in trying it out. I went through the API docs and it seems straightforward enough. However, what's the pricing? Can't find it anywhere. Is it free?
I believe all of the Watson services are free while in beta and will be paid services once they mature a bit more and exit the beta.
yes but how do you sign up for either service?
The Watson services are only accessible through Bluemix for the moment. Create an account on https://console.ng.bluemix.net and then add a service instance.

The idea was that you'll also host your application in bluemix, although I think the services are actually accessible elsewhere once you create the instance in bluemix.

you mean ibms service?
For TTS, compare further with Vocalware and CereProc

Vocalware https://www.vocalware.com/index/demo CereProc https://www.cereproc.com/

It is getting increasingly difficult to pick one as the clear leader for "natural sounding". The results are good enough for voicing canned text, and certainly better enunciated than many thick-accented English speakers. Improvements through training can still be made in parsing the text.

For example, IBM Watson interprets "IT" as "it", in the following sentence.

Thank you for calling the IT department.

Vocalware and CereProc correctly parse that.

Who I would really like to hear opinions from are professional voice actors, though they would tend to be understandably leery to lend a hand to improve TTS. Is there a standardized form of writing text that communicates the kind of emphasis, placement of silence and warping of phonemes these actors use in their delivery to concisely convey emotion, that TTS products can adopt?

SSML is a speech synthesis markup language that has some degree of popularity in the field. The specific section on markup for emphasis is http://www.w3.org/TR/speech-synthesis11/#S3.2
I believe that the Nuance technology is built on IBM Speech research: http://www.nuance.com/for-business/by-solution/customer-serv...
My evidence is anecdotal at best, but I have found Siri to be terrible and my "OK, Google" to be wonderful.
As a speech technologist, I am amazed and proud about how far long the technology has progressed, especially over the last few years. Even my wife now uses speech input on mobile devices (and may finally think I may be doing something productive...). With that said, speech input is still a surprisingly finicky technology and different people will see different beahviors across systems from different providers.
I can only imagine how finicky it is. But it is truly amazing tech, and quite revolutionary. I probably do about 75%+ of my searches via voice, and it would more likely be 90%+ if I wasn't embarrassed about talking to my phone in public and broadcasting my searches to anyone in earshot :P
Siri was completely unusable/unresponsive from 2011/2012, but then, somewhere around 2012/2013, started to become pretty good (most of the time) for things like, "Wake me up at 6:30 AM" - I used it for that type of query a lot. Dictation, though, was spotty - I would say about 10-20% of the time, I just got a spinning non-response, and even when it did work, it would be slow, and the results would be iffy. And, once again, I used the dictation a lot.

But - sometime in 2014, and I can't really place it - but right around June/August, Siri all of a sudden turned a corner, and her dictation ability got markedly better - so much now, that I don't even bother typing into my iPhone if I'm in a place where I can talk to it - dictation is 99% flawless. much better than my typing, and unquestionably faster.

For whatever reason, Apple hasn't been making a big deal of this - perhaps because they don't want to admit how crappy it was before - but it really is a big deal. Siri is, 3 years later, what she should have been in 20111.

Can't wait to see what the next step in this evolution will be...

My understanding is that it is acoustic modeling that was drastically improved using deep learning. That is, while speech recognition improved, acoustic modeling improved more. So, strictly speaking, technology is now better at ignoring noise, rather than better at understanding speech. Of course, to users, there is no difference.
The Watson voice is great, but I think CereProc voices sound the most natural. Also, I like that you can use them offline.