| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by swyx 596 days ago

> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.

I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:

https://www.latent.space/p/notebooklm

TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available

2 comments

famouswaffles 596 days ago

Soundstorm is probably the TTS https://google-research.github.io/seanet/soundstorm/examples...

link

swyx 596 days ago

they didnt confirm or deny this in the episode - all i can say is there are about 1-2 yrs of additional research that went into nblm's tts. soundstorm is more of an efficiency paper imo

link

refulgentis 596 days ago

Really good catch. Ty.

link

ttul 596 days ago

Thank you swyx. How did I miss this episode?

link

swyx 595 days ago

did you LIKE and SUBSCRIBE?? :)

link