| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ttul 601 days ago

The more I listen to NotebookLM “episodes”, the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone. The two speakers interrupt and speak over each other in an uncannily humanlike manner. I wonder whether they basically fine tuned against a huge library of actual podcasts along with the podcast transcripts and perhaps generated synthetic “input material” from the transcripts to feed in as training samples.

In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.

I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.

3 comments

famouswaffles 601 days ago

Following up on swyx, the TTS is probably Google finally releasing Soundstorm from the basement.

https://google-research.github.io/seanet/soundstorm/examples...

link

swyx 601 days ago

> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.

I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:

https://www.latent.space/p/notebooklm

TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available

link

famouswaffles 601 days ago

Soundstorm is probably the TTS https://google-research.github.io/seanet/soundstorm/examples...

link

swyx 601 days ago

they didnt confirm or deny this in the episode - all i can say is there are about 1-2 yrs of additional research that went into nblm's tts. soundstorm is more of an efficiency paper imo

link

refulgentis 601 days ago

Really good catch. Ty.

link

ttul 601 days ago

Thank you swyx. How did I miss this episode?

link

swyx 599 days ago

did you LIKE and SUBSCRIBE?? :)

link

rmorey 601 days ago

I feel similarly about NotebookLM, but have noticed one odd thing - occasionally Host A will be speaking, and suddenly Host B will complete their sentence. And usually when this happens, it's in a way that doesn't make sense, because Host A was just explaining something to or answering a question of Host B.

I'm actually not sure what to make of that, but it's interesting to note

link

dleeftink 601 days ago

It's speaker diarisation, and depending on the quality of the resulting labelling and speaker end marker tokens, what influences the rhythm of a conversation (Or the input data just has many podcast hosts completing each other's..sandwiches?)

link

behnamoh 601 days ago

That's the annoying part about NLM. It ruins the illusion of having one person explaining it to the other person.

link

albert_e 601 days ago

I think this is an important enough quality that betrays that there are no two minds here creating 1+1=3.

One cheap trick to overcome this uncanny valley may be to actually use two separate LLMs or two separate contexts / channels to generate the conversations and take "turns" to generate the followup responses and even interruptions if warranted.

Might mimic a human conversation more closely.

link

thomashop 601 days ago

Funnily, even two different LLMs, when put in conversation with each other, can end up completing each other's sentence. I guess it has something to do with the sequence prediction training objective.

link

newsbinator 601 days ago

And this regularly happens with humans too

link

benmo_atx 601 days ago

Those moments always make me think they’re going for a scripted conversation style where the “learner” is picking up the thread too quickly and interjecting their epiphany inline for the benefit of the listener.

link