| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tmjdev 639 days ago
	While it is impressive and I like to follow the advancements in this field, it is incredibly frustrating to listen to. I can't put my finger on why exactly. It's definitely closer to human-sounding, but the uncanny valley is so deep here that I find myself thinking "I just want the point, not the fake personality that is coming with it". I can't make it through a 30s demo.

23 comments

swatcoder 639 days ago

We're used to hearing some kind of identity behind voices -- we unconsciously sense clusters of vocabulary, intonation patterns, ticks, frequent interruption vs quiet patience, silence tolerance, response patterns to various triggers, etc that communicate a coherent person of some kind.

We may not know that a given speaker is a GenX Methodist from Wisconsin that grew up at skate parks in the suburbs, but we hear clusters of speech behavior that lets our brain go "yeah, I'm used to things fitting together in this way sometimes"

These don't have that.

Instead, they seem to mostly smudge together behaviors that are just generally common in aggregate across the training data. The speakers all voice interrupting acknowledgements eagerly, they all use bright and enunciated podcaster tone, they all draw on similar word choice, etc -- they distinguish gender and each have a stable overall vocal tone, but no identity.

I don't doubt that this'll improve quickly though, by training specific "AI celebrity" voices narrowed to sound more coherent, natural, identifiable, and consistent. (And then, probably, leasing out those voices for $$$.)

As a tech demo for "render some vague sense of life behind this generated dialog" this is pretty good, though.

adamhartenz 639 days ago

To be fair, the majority of podcasts are from a group of generic white guys, and they almost sound identical to these AI generated ones. The AI actually seems to to do a better job too.

freestyle24147 639 days ago

Citation absolutely needed. You call this fair?

> the majority of podcasts are from a group of generic white guys

sangnoir 639 days ago

https://podcastcharts.byspotify.com/ keep the Pareto distribution in mind

neom 639 days ago

I did the best fast research I could given not wanting to spend more than 20 minutes on it and came to this result (aprox): - Mixed/Diverse: 48.0% - White Men: 35.0% - Women: 8.0% - Non-White: 6.0% - White Woman: 2.0% - Non-White Woman: 1.0%

tightbookkeeper 634 days ago

I love the "white = generic = bland" meme too. All the science and literature before 1930 it was trained on is also likely "generic white guys".

sourcepluck 638 days ago

I laughed and sort of agreed with this, in spirit. You're off on the details I think though.

When I listened to the audio samples before coming to the comments, I thought: "oh, like those totally lifeless and bland U.S. accents from podcasts, YT, etc."

I wouldn't associate it with skin colour or gender though at all. I've no idea why you'd go there - any skin colour and any gender is absolutely welcomed into the fold of U.S. cultural production, if they can produce bland generic "content" sincerely enough, it seems to me.

Disclaimer: many U.S. accents are interesting and wonderful (Colorado; Tom Waits), they don't all sound generic and bland. I have U.S. friends therefore I can pass judgment (TM).

TimTheTinker 639 days ago

Whether this stops at the uncanny valley or progresses to specific "AI celebrity" voices, I'm left thinking the engineers involved in this never stopped to think carefully about whether this ought to be done in the first place.

jsheard 639 days ago

"Surely my genAI product won't be used to spam zero-effort slop all over the internet!"

- guy whose genAI product will definitely be used to spam zero-effort slop all over the internet.

_DeadFred_ 639 days ago

I think their main target is corporate creative jobs. Background music to ads/videos/etc. And just like with all AI, they will eat the jobs that support the rest of the system, making it a one and done. It will give a one time boost, and then be stuck at that level because creatives won't have the jobs that allowed them to add to the domain. In this case new music styles. New techniques. It's literally eating the seed corn where the sprouts are the creatives working in the boring commercial jobs that allow them to practice/become experts in the tools/etc that they then build up it all. Their goal is cut the jobs that create their training data and the ecosystem that builds up/expands the domain. Everywhere AI touches will basically be 'stuck using Cobol' because AI will be frozen at the point in time where the energy infusing 'sprouts' all had their jobs replaced by AI and without them creating new output for AI to train on it's all ossified.

We are witnessing in real time the answer to why 'The Matrix' was set when it was. Once AI takes over there is no future culture.

GuB-42 639 days ago

Assuming you are right and that we will miss a generation of creatives and AI keeps making crap, why can't the creative field regrow. AI won't remove creativity from human genes.

As people get fed up with AI generated crap, companies will start to pay very good money to the few remaining good human creatives in order to differentiate themselves. The field will then be seen as desirable, people will start working hard for to get these jobs, companies will take apprentices hoping they will become masters later, etc... We may lose a generation, but certainly not the entire future.

Of course, it is just one of many possible futures, but I think the most likely if you take your assumptions as a postulate. It may turn out that AIs end up not displacing creative jobs too much, or going the other way, that AIs end up being truly creative, building their own culture together with humans, or not.

thanksgiving 638 days ago

It makes sense to me.

Step 0. Some People make novel art like a jingle that is unlike anything yet.

Step 1. Early use of said jingle creates a buzz and generated good sales results.

Step 2. It gets copied everywhere and by everyone. It is now a meme.

This is the step I think where generative AI can help. Slightly transform existing art to fit a particular purpose. This lets businesses save money by not paying humans do this work.

Problem is we don't know where the next person or when this step 0 comes from... When we soak up all the "slack" and send all the "money" to the top because lets face it that's how it will work. The money "saved" from AI won't make goods and services cheaper by any significant measure. We will still have to pay as much as we can afford to pay.

grugagag 639 days ago

> It's literally eating the seed corn where the sprouts are the creatives working in the boring commercial jobs that allow them to practice/become experts in the tools/etc that they then build up it all.

This is a big problem that needs to be talked about more, the endgoal of AI seems to be quite grim for jobs and generally for humans. Where will this pure profit lead to? If all advertising will be generated who will want to have anything to do with all the products they’re advertising?

thanksgiving 638 days ago

Reminds me of that famous clip from mad men where don suddenly realizes that if lucky strike can't say its cigarettes are safe, neither can its competitors and came up with "it's toasted".

In general, I have a feeling double digit growth forever is impossible. Facebook and Google both reported YoY growth in 15%+ this week iirc and I have a feeling they are only able to achieve this by destroying either competitors or adjacent industries rather than by "making the pie bigger". It will end at some point.

wiz21c 638 days ago

this is very spot on. There are tons of artists who have a job so they can sustain their own personal creativity.

pmontra 638 days ago

Almost everyone has a job to sustain their actual interests. Some of them happen to be musicians, writers, etc. Others play football, go fishing, talk to friends. There is nothing special in there. All of us will keep doing what we like to do even after AIs become the tool of mainstream creativity.

elpocko 638 days ago

A meme parrot is posting the millionth copy of the same comment complaining about "slop". Slightly ironic.

Fricken 638 days ago

It's the holy grail. When people can have naturalistic conversations with their computers they will love it more than other people. Ai doesn't need to be useful so much as it needs to be loved. That's the secret to getting AI between people and everything they do in a day.

lancesells 639 days ago

Agreed. To me it sounds like bad voice-over actors reading from a script. So the natural parts of a conversation where you might say the wrong thing and step back to correct yourself are all gone. Impressive for sure.

xico 638 days ago

Yup. Plus the interactions you'd expect for instance in terms of matching style of voice in a normal discussion are missing. That being said it still sounds pretty impressive.

htrp 639 days ago

every step of technological advancement builds on top of the previous one.

now it's bad voice actors, in 2 years it'll be great ones

Cthulhu_ 638 days ago

This is why voice actors are on strike to stop their voices from being used with AI. I mean it's probably futile.

serf 638 days ago

it's probably futile, and the 'AI/art protests' seem to miss the point that the protest itself is also encouraging The Man to seriously consider AI-powered replacement.

The protest itself is exactly the kind of thing that will be avoided by replacing humans, demonstrated writ-large for the people with the cheque-book.

I can understand the spirit of protest and why it occurs, but it just seems so out-of-line strategically/tactically when used against automation that's taking jobs.

Just the order of events is kind of funny to me, and this applies to automation-job-taking protest the world over : A technique is demonstrated that displaces workers, the workers then picket and refuse to work -- understandable, but faced with the current prospect of "This mechanism performs similar work for cheaper", it seems counter-productive to then demonstrate the worst-case-scenario for the patron : a work stoppage that an automated workforce would never experience, alongside legal fees that would never be encountered had they an automated work-force.

That all said, protest is one of the only weapons in the arsenal of the working -- it just feels as if the argument against automation is one of the places where that technique rings hollow.

In the case of media/movies/literature/etc, I think the power to force corporations to value humans is solely in the hands of the consumer -- and unfortunately that's such an unorganized 'group' that it's unlikely they will establish any kind of collective action that would instantiate change.

beoberha 639 days ago

Totally agree. Maybe it’s just the clips they chose, but it feels overfit on the weird conversational elements that make it impressive? Like the “oh yeahs” from the other person when someone is speaking. It is cool to see that natural flow in a conversation generated by a model, but there’s waaaay too much of it in these examples to sound natural.

And I say all that completely slackjawed that this is possible.

echelon 639 days ago

I love the technology, but I really don't want AI to sound like this.

Imagine being stuck on a call with this.

> "Hey, so like, is there anything I can help you with today?"

> "Talk to a person."

> "Oh wow, right. (chuckle) You got it. Well, before I connect you, can you maybe tell me a little bit more about what problem you're having? For example, maybe it's something to do with..."

cmehdy 639 days ago

That's how the DJ feature of Spotify talks and it's pretty jarring.

"How's it going. We're gonna start by taking you back to your 2022 favorites, starting with the sweet sounds of XYZ". There's very little you can tweak about it, the suggestions kinda suck, but you're getting a fake friend to introduce them to you. Yay, I guess..

FuckButtons 639 days ago

Reminds me of the robots from the Sirius cybernetics corporation. “Your plastic pal who’s fun to be with.”

kelseyfrog 639 days ago

I'd love to see stats on disfluency rate in conversation, podcasts, and this sample to get an idea of where it lies. It seems like they could have cranked it up, but there's also the chance that it's just the frequency illusion because we were primed to pay attention to it.

amelius 639 days ago

> Like the “oh yeahs” from the other person when someone is speaking.

I bet that if you select a British accent you will get fewer of them.

bryanrasmussen 639 days ago

I'm hoping it will be a lot of Ok Guv'ner and right you ares in the style of Dick Van Dyke.

mindcrime 639 days ago

Gor blimey lad, that's the problem now innit???

KineticLensman 639 days ago

> a British accent

Hmm.... Scottish, Welsh, Irish (Nor'n) or English? If English, North or South? If North, which city? Brummie? Scouse? If South, London? Cockney or Multicultural London English [0]?

[0] https://en.wikipedia.org/wiki/Multicultural_London_English

beAbU 639 days ago

Need to increase your granularity a bit. I live in Wexford Town, Ireland, and the other day I was chatting to a person that told me their old schoolmates from Castlebridge are making fun of their accent changing since moving from their hometown.

Castlebridge is 10 minutes away by car. Madness!

KineticLensman 639 days ago

Yeah, totally agree. Here's a useful link for non-Brits, that goes into a bit more detail:

https://accentbiasbritain.org/accents-in-britain/

Also, we have yet to define precisely define what is meant by 'British'. This probably needs a "20 falsehoods people believe about..."-type article.

shiroiushi 639 days ago

When people outside the British isles (esp. Americans) say "British accent", they almost invariably mean (British) English, and usually the "received pronunciation" accent that British media generally uses.

They do not mean Irish or Scottish accents; if they did, they would have said exactly that, because those accents are quite different from standard (British) English accents. So different, in fact, that even Americans can readily tell the difference, when they frequently have some trouble telling English and Australian accents apart.

Also, to most English speakers, "English accent" doesn't make much sense, because "English" is the language. It sounds like saying a German speaker, speaking German, has a "German accent". Saying "British accent" differentiates the language (English, spoken by people worldwide) from the accent (which refers to one part of one country that uses that language).

kelseyfrog 639 days ago

Right mate

Dilettante_ 639 days ago

Cheeky bugger, you are

lebuffon 639 days ago

ee by gum

hyperific 639 days ago

It's like their training set was made up entirely of awkward podcaster banter.

ukuina 639 days ago

At least 83% Leo Laporte.

jjw1414 638 days ago

If I turn the volume down to the point that I only hear the cadence/rhythm of the voices, but can no longer make out the words, it sounds like any, “This Week in…” podcast.

lokimedes 638 days ago

For me it isn’t uncanny from a lack of humanity. Rather, it triggers all my “fake and shallow” personality biases. It certainly sounds human enough, just not the type of humans I like.

xnx 639 days ago

Agreed. To be fair, I also get annoyed by fake/exaggerated expression from human podcasters.

onion2k 639 days ago

That could just be the context though. Listening to a clip that's a demo of what the model can produce is very different to listening to a YouTube video that's using the model to generate speech about something you'd actually want to watch a video of.

iNic 639 days ago

It sounds like every sentence is an ad read.

JoblessWonder 639 days ago

Yeah... It isn't that it doesn't sound like human speech... it just sounds like how humans speak when they are uncomfortable or reading prepared and they aren't good at it.

rob 639 days ago

Probably because you're expecting it and looking at a demo page. Put these voices behind a real video or advertisement and I would imagine most people wouldn't be able to tell that it's AI generated at all.

Veen 639 days ago

It'd be annoying to me whether it was AI or human. The faux-excitement and pseudo-bonhomie is grating. They should focus on how people actually talk, not on copying the vocal intonation of coked-up public radio presenters just back from a positive affirmation seminar.

semitones 639 days ago

I suppose it doesn't matter if it is a human, or a bot delivering the message, if the message is boring

MrSkelter 638 days ago

I agree. It’s profoundly sad that so much energy is being invested in solving the non-problem of making long documents accessible. To think that people will ignore carefully written work for the “chat show” output of an LLM is horrifying and a harbinger of a societal slide into happy stupidity and willing ignorance.

kaibee 639 days ago

> Example of a multi-speaker dialogue generated by NotebookLM Audio Overview, based on a few potato-related documents.

Listening to this on 1.75x speed is excellent. I think the generated speaking speed is slow for audio quality, bc it'd be much harder to slow-down the generated audio while retaining quality than vice versa.

moralestapia 639 days ago

It's due to the histrionic mental epidemic that we are going through.

A lot of people are just like that IRL.

They cannot just say "the food was fine", it's usually some crap like "What on earth! These are the best cheese sticks I've had IN MY EN TI R E LIFE!".

shermantanktop 639 days ago

“I’m OBSESSED with the dipping sauce. So good.”

Cthulhu_ 638 days ago

I tuned it out instantly because I have that feeling with most Americans / podcasts / etc already. That said, it's a convincing enough analog for that kind of content I think.

pmontra 638 days ago

It doesn't feel any different to me than listening to a random radio station where I don't know who is speaking. I didn't feel any uncanny valley but I'm not an English native speaker so I might miss some nuances. However there are relatively few English native speakers around the world so this might not be a problem for us.

The problem is that people talking over each other is not a format I long to listen to.

narag 639 days ago

While it is impressive and I like to follow the advancements in this field...

Please don't think that I'm trying to suggest... anything . It's just that I'm getting used to read this pattern in the output of LLMs. "While this and that is great...". Maybe we're mimicking them now? I catch myself using these disclaimers even in spoken language.

tmjdev 639 days ago

I like to preface negativity with a positive note. Maybe I am influenced in my word choice but my intent was to point out that this is a very, very impressive feat and I don't want to undermine it.

nl 638 days ago

Whilst I don't doubt you feel like that the general response to the notebook LLM podcast feature (which uses this) has been very well received generally.

In general people find the back and forth between the "hosts" engaging and also gives people time to digest the contents.

ljf 638 days ago

When I got to the bit where they referred to the smaller training set of paid voice actors, that hit it for me. It certainly sounds like they are throwing the 'um' and 'ah's in to a script - not naturally.

This is good, but certainly not yet great.

jeksicjjdjisos 638 days ago

There’s a certain fakeness to the rhythm of the space between words. Particularly the “uh” and “um” filler sounds. To me it sounds like they always either come in abnormally early or late after speaking those sounds

pvarangot 639 days ago

It's because it's probably trained with "professional audio", ads, movies, audiobooks, and not "normal people talking". Like the effect when diffusion was mostly trained with stock photos.

yapyap 639 days ago

they all sound like valley-people, complete with the raspy voice and everything

chrismorgan 638 days ago

> Audio clip of two speakers telling a funny story, with laughter at the punchline.

In similar vein, I’m glad they told me it was a funny story, because otherwise I wouldn’t have known.

vel0city 638 days ago

I got a similar feeling. I think it was overdoing the ums and uhhs for something trying to sound like an even slightly professional podcast kind of sound.

gwbas1c 639 days ago

I get the feeling that this is useful for something that someone half-listens to.