Hacker News new | ask | show | jobs
by grandalf 3574 days ago
This is incredible. I'd be worried if I were a professional audiobook reader :)
4 comments

I worked for Audible for five years, and this exact conversation was had often in my division (ACX.com - Audible's "Audiobook Creation Exchange".)

Audible brought ACX together in order to bolster its catalog. The company-wide initiative was called PTTM ('pedal to the metal') and ACX was Audible's secret weapon to gain an enormous competitive foothold over the rest of the audiobook industry. Because we paid amateurs dirt-cheap rates to record horrible, self-published crap (to which Amazon, Audible's parent company had the exclusive rights), Audible was able to bolster its numbers substantially in a short period of time.

The dirty not-so-secret behind this strategy was: nobody bought these particular audiobooks. These audio titles were not really made to be "purchased," but rather to bulk up Audible's bottom line. We knew that the ACX titles were not popular, because the amateur narrators' acting talents and audio production skills were remarkably subpar.

Neural nets may be able to narrow the gap between the pros and the lowest-common-denominator to the point where they can become the next "ACX," but frankly, it won't matter to audiobook listeners, because audiobook listeners don't buy "ACX" audiobooks. Books, even in audio form, are a major intellectual and temporal commitment (not to mention -- they tend to be pricey.) Customers will always want to buy the human-narrated version of a book - the professional production of a book. If that stops being offered, Audible will anger a lot of customers and I think Bezos has better shit to worry about than his puny audiobooks subsidiary.

Despite that, user-generated content is a secret weapon that a lot of websites wield effectively - including HN - but this is beginning to shed its effectiveness. Indeed, the next generation of cost-slashing-while-polluting-the-quality-of-your-catalog will belong to the neural nets. They may be able to get better sales than ACX titles do today with AI-generated audio content, but the actors are going nowhere.

I've listened to some LibriVox recordings of public domain works, notably A Princess of Mars. The price was right at the time, though the quality was, as you say, remarkably subpar. If I could have had a neural net read me the book instead of having to change with narrators changing every chapter, that would have been preferable.

That said, I have money now, so give me Todd McLaren narrating Altered Carbon for the cost of an Audible Credit every time.

I wouldn't. The results they offer are excellent, but the missing points they need to achieve human level are related to producing the correct intonation, which requires accurate understanding of the material. That is still at least ten years in the future, I expect.
Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

theres 0 chance of effective intonation and tone without understanding of the material
I think your use of the term "understanding" is very unhelpful here. It's better to think about what you need to condition on to predict correctly.

In fact most intonation decisions are pretty local, within a sentence or two. The most important thing are given/new contrasts, i.e. the information structure. This is largely determined by the syntax, which we're doing pretty well at predicting, and which latent representations in a neural network can be expected to capture adequately.

The same sentence can have a very nonlocal difference in intonation.

Say, “They went in the shed”. You won't pronounce it in a neutral voice if it was explained in the previous chapter that a serial killer is in it.

On the other hand, if the shed contains a shovel that is quickly needed to dig out a treasure, which is the subject of the novel since page 1, you will imply urgency.

With enough labor, you could annotate enough sentences to cover a lot of dialogue cases. Sections like "'stop!', he said angrily/dryly/mockingly are probably fairly common. You'd be modeling the next most probable inflection given previous words and selected tones.

What would require understanding would be novel arrangements and metaphor to indicate emotional state. On the fly variations to avoid mononticity might also be difficult, as well as sarcasm or combinations/levels (e.g. she spoke matter of factly but with mirth lightly woven through).

And who says it can't understand the material? There have been recurrent networks trained that can translate between languages, or predict the next word in a sentence, at remarkable accuracy. Combined with wavenet this could be quite effective.
There could be cases where the intonation is dependent on things entirely outside of the book. If say a politician does something in the writing that is far from what we would expect them to do in today's world.
How about we allow annotation of text with prosody cues? Mark the words you want stressed. We already use question and exclamation marks.
Like traditional audio books can capture perfectly what you're referring to...
I don't see why many aspects of intonation couldn't be taught the same way ...
I think the point is that different parts of the story need different intonation patterns (reading a scary part vs a boring part, etc.).

So in theory, it could be achieved by having multiple training sets (for the different intonation styles), along with analysis of the text to direct which part of the text needs what intonation. You might even be able to blend intonations.

Or just pay MTurk workers to annotate texts with intonation cues.

I kinda doubt that would be profitable relative to just hiring readers, but in general you don't need to replace workers completely to cannibalize some of their wages/jobs.

Or treat it as part of the original author's job. When you write a piece of music you add tempo and intensity metadata to the score, so why not do the same when writing a novel?
Or the author could just add that information to the text. This way there's no need to "understand" it.
There is significant advance in sentiment analysis too. Trading bots use sentiment analysis as some of the input for their time series prediction algorithms. I would not say 10 years.
What about auto-tuning? I can do a pretty good reading-with-intention but I don't have the melt-your-brain-rich tones of Stephen Fry or Ian McKellen.
That is so exciting for me. I love listening to audiobooks when I'm walking my dog, or driving, or something boring that doesn't need my brain but does need my arms.

The issue is the selection is so much smaller than the selection of books.

Indeed. It also sounds like it could be trained to correctly read math or code, the two things that require enough expertise to properly pronounce that most text to speech engines fail miserably.

Something like:

  a(b+c)
"a times the quantity b plus c"

If read with proper inflection, this would be a vast improvement and could open up all sorts of technical material to people for whom audio learning is preferred.

I think back to the first math teacher I had whose pronunciation of the notation was precise and unambiguous enough that one didn't really have to be watching the board. This is a rare gift, yet it is possible in many areas of math, yet few teachers master it (or realize how helpful it is).

I'm an audiobook junkie and as far as professional narrators go, I think it'd be hard to replace a high-end performance with something computer generated and end up with the level of quality offered by the likes of a great narrator like Scott Brick. I mention him by name because it was him that made me realize how important good quality narration is. I had purchased a book at an airport bookstore on a whim and while waiting for a plane was so disgusted with the poor quality of the writing that I actually threw the book out[0]. Years later, I had grabbed an audio book by an author I hadn't heard of simply because it was read by Scott Brick and recommended to "Read Next". Two hours in and I realized the book I had been enjoying so much was the same terrible book I had thrown out years before[1].

While I don't doubt it'll be possible for a computer to match it with enough input data (both in voice and human adjustment), it'll probably be a while before we'll be there and when we are there it'll likely require a lot of adjustment on the part of a professional. A big part of narration is knowing when and where a part of the story requires additional voice acting (and understanding what is required). A machine generated narration would have to understand the story sufficiently to be able to do that correctly. They might be able to get the audio to sound as good as it would sound if I narrated it, but someone with talent in the area is going to be hard to match.

All of that aside, it's getting pretty close to "good enough". When it reaches that point, my hope is more books will have audio versions available[2] and in all likelihood, some books that would have been narrated by a person today will likely be narrated by technology when it reaches that point, limiting human narration only to the top x% of books.

[0] I always resell books or donate them. This book was so bad that the half-hour it took from my life felt like a tragedy. I threw it out to prevent someone from experiencing its awfulness -- even for free.

[1] I realized it was the same book at the point a story was told that I had only read in the first book (and found mildly humorous). The reason I hated the other book was that it was written in the first person as a New York cop. I couldn't form a mental picture and the character was entirely unbelievable and one dimensional. When narrated properly, that problem was eliminated.

[2] I "speed read" (not gimmicky ... scan/skimming) and consume a ton of text. I've been doing it for 20 years or so and find it difficult to read word-for-word as is required for enjoyment of fiction, so to "force" it, I stick with audio books for fiction and love them.

I too greatly appreciate highly skilled readers. It's another layer of creativity and inspiration in addition to the text, and when done well adds a lot to the book.
I only fell in love with the voice of a single audiobook narrator. I checked, and yes, he was Scott Brick. I think he adds about 50% on top of the value of the written book by his interpretation.
He's incredible - some people complain that he's a little bit of a slow reader, but audio book apps usually have a speed option. He enunciates well and adds a depth of feeling to the work that can take a book that's average up several notches.

He's also the only narrator that I can name[0].

[0] Who isn't well known for other things -- Douglas Adams narrated his entire series, and some actors are also regular audio book narrators, but Scott Brick is purely a narrator (or at least, was when I last looked).