Hacker News new | ask | show | jobs
by espadrine 3571 days ago
I wouldn't. The results they offer are excellent, but the missing points they need to achieve human level are related to producing the correct intonation, which requires accurate understanding of the material. That is still at least ten years in the future, I expect.
4 comments

Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

theres 0 chance of effective intonation and tone without understanding of the material
I think your use of the term "understanding" is very unhelpful here. It's better to think about what you need to condition on to predict correctly.

In fact most intonation decisions are pretty local, within a sentence or two. The most important thing are given/new contrasts, i.e. the information structure. This is largely determined by the syntax, which we're doing pretty well at predicting, and which latent representations in a neural network can be expected to capture adequately.

The same sentence can have a very nonlocal difference in intonation.

Say, “They went in the shed”. You won't pronounce it in a neutral voice if it was explained in the previous chapter that a serial killer is in it.

On the other hand, if the shed contains a shovel that is quickly needed to dig out a treasure, which is the subject of the novel since page 1, you will imply urgency.

With enough labor, you could annotate enough sentences to cover a lot of dialogue cases. Sections like "'stop!', he said angrily/dryly/mockingly are probably fairly common. You'd be modeling the next most probable inflection given previous words and selected tones.

What would require understanding would be novel arrangements and metaphor to indicate emotional state. On the fly variations to avoid mononticity might also be difficult, as well as sarcasm or combinations/levels (e.g. she spoke matter of factly but with mirth lightly woven through).

And who says it can't understand the material? There have been recurrent networks trained that can translate between languages, or predict the next word in a sentence, at remarkable accuracy. Combined with wavenet this could be quite effective.
There could be cases where the intonation is dependent on things entirely outside of the book. If say a politician does something in the writing that is far from what we would expect them to do in today's world.
How about we allow annotation of text with prosody cues? Mark the words you want stressed. We already use question and exclamation marks.
I'd love that. Writing is a poor representation of language. It'd be nice to bring it up a notch. Here's a suggestion in a paper I wrote on better second language acquisition. https://www.researchgate.net/publication/261022308_BETTER_SE...
Like traditional audio books can capture perfectly what you're referring to...
They can, though?
I don't see why many aspects of intonation couldn't be taught the same way ...
I think the point is that different parts of the story need different intonation patterns (reading a scary part vs a boring part, etc.).

So in theory, it could be achieved by having multiple training sets (for the different intonation styles), along with analysis of the text to direct which part of the text needs what intonation. You might even be able to blend intonations.

Or just pay MTurk workers to annotate texts with intonation cues.

I kinda doubt that would be profitable relative to just hiring readers, but in general you don't need to replace workers completely to cannibalize some of their wages/jobs.

Or treat it as part of the original author's job. When you write a piece of music you add tempo and intensity metadata to the score, so why not do the same when writing a novel?
Or the author could just add that information to the text. This way there's no need to "understand" it.
There is significant advance in sentiment analysis too. Trading bots use sentiment analysis as some of the input for their time series prediction algorithms. I would not say 10 years.
What about auto-tuning? I can do a pretty good reading-with-intention but I don't have the melt-your-brain-rich tones of Stephen Fry or Ian McKellen.