| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ripperdoc 1196 days ago
	Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.

2 comments

JonathanFly 1196 days ago

>Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.

The other side of that problem is an opportunity. That's why the same model can also generate music, background noise and sound effects. And it's just because the prompt specifies those things explicitly. The input is truly semantic, so the output is rich and reflects that context. Is your input text sounds like it came from a speech, then there's a high chance your output audio will sound like a megaphone in a public space with crowd reactions and maybe even applause.

link

CreepGin 1196 days ago

I hear it too. I don't know if it's just background noise though. May be quality issues with the audio synthesis.

link

gkucsko 1196 days ago

yeah sometimes there are definitely artifacts. technically they can be removed pretty easily with another model (like denoiser from FB) but for now we wanted to keep it simple to learn to control these things better through prompt engineering. Like when using a high quality input prompt it generally continues with high quality

link

meepmorp 1196 days ago

At least in the last example, with the man and woman and the expensive oat milk, the background noise seemed to fit a likely public conversation scenario. I wasn't sure if it was accidental or not.

link