Hacker News new | ask | show | jobs
by ripperdoc 1149 days ago
Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.
2 comments

>Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.

The other side of that problem is an opportunity. That's why the same model can also generate music, background noise and sound effects. And it's just because the prompt specifies those things explicitly. The input is truly semantic, so the output is rich and reflects that context. Is your input text sounds like it came from a speech, then there's a high chance your output audio will sound like a megaphone in a public space with crowd reactions and maybe even applause.

I hear it too. I don't know if it's just background noise though. May be quality issues with the audio synthesis.
yeah sometimes there are definitely artifacts. technically they can be removed pretty easily with another model (like denoiser from FB) but for now we wanted to keep it simple to learn to control these things better through prompt engineering. Like when using a high quality input prompt it generally continues with high quality
At least in the last example, with the man and woman and the expensive oat milk, the background noise seemed to fit a likely public conversation scenario. I wasn't sure if it was accidental or not.