| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by syllogism 3578 days ago

Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

1 comments

ycombinatorMan 3578 days ago

theres 0 chance of effective intonation and tone without understanding of the material

link

syllogism 3577 days ago

I think your use of the term "understanding" is very unhelpful here. It's better to think about what you need to condition on to predict correctly.

In fact most intonation decisions are pretty local, within a sentence or two. The most important thing are given/new contrasts, i.e. the information structure. This is largely determined by the syntax, which we're doing pretty well at predicting, and which latent representations in a neural network can be expected to capture adequately.

link

espadrine 3577 days ago

The same sentence can have a very nonlocal difference in intonation.

Say, “They went in the shed”. You won't pronounce it in a neutral voice if it was explained in the previous chapter that a serial killer is in it.

On the other hand, if the shed contains a shovel that is quickly needed to dig out a treasure, which is the subject of the novel since page 1, you will imply urgency.

link

Cybiote 3577 days ago

With enough labor, you could annotate enough sentences to cover a lot of dialogue cases. Sections like "'stop!', he said angrily/dryly/mockingly are probably fairly common. You'd be modeling the next most probable inflection given previous words and selected tones.

What would require understanding would be novel arrangements and metaphor to indicate emotional state. On the fly variations to avoid mononticity might also be difficult, as well as sarcasm or combinations/levels (e.g. she spoke matter of factly but with mirth lightly woven through).

link

Houshalter 3578 days ago

And who says it can't understand the material? There have been recurrent networks trained that can translate between languages, or predict the next word in a sentence, at remarkable accuracy. Combined with wavenet this could be quite effective.

link

thomasahle 3577 days ago

There could be cases where the intonation is dependent on things entirely outside of the book. If say a politician does something in the writing that is far from what we would expect them to do in today's world.

link

visarga 3577 days ago

How about we allow annotation of text with prosody cues? Mark the words you want stressed. We already use question and exclamation marks.

link

atty79 3577 days ago

I'd love that. Writing is a poor representation of language. It'd be nice to bring it up a notch. Here's a suggestion in a paper I wrote on better second language acquisition. https://www.researchgate.net/publication/261022308_BETTER_SE...

link

spiritus_ 3577 days ago

Like traditional audio books can capture perfectly what you're referring to...

link

ycombinatorMan 3577 days ago

They can, though?

link