| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cypher543 3488 days ago

One thing I wish more services like this offered is non-speech sounds. CereVoice, for example, lets you insert laughs, coughs, sighs, etc and it can really enhance the output in some cases. Google's WaveNet also manages to simulate the catching of one's breath during particularly long utterances, although I realize it uses a completely different technique (neural net vs. concatenative synthesis).

My biggest problem with CereVoice, though, has been its terrible web API. It doesn't support streaming output, so it renders the audio to an Amazon S3 bucket and then returns a URL, which is pretty inconvenient (and slow). You have to do the same for transcripts, too. So, if you want everything, you have to make 3 separate HTTP requests and parse 2 XML documents for one round of synthesis.

IBM Watson's TTS API gets it right, imo. Its streaming mode returns audio frames and transcripts over a WebSocket connection.