AudioPaLM: A Large Language Model That Can Speak and Listen | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	AudioPaLM: A Large Language Model That Can Speak and Listen (google-research.github.io)
	119 points by ml_basics 1092 days ago

8 comments

ml_basics 1092 days ago

> We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2.

Direct link to demo video showing speech-to-speech translation: https://google-research.github.io/seanet/audiopalm/examples/... (see website for more example)

ot 1092 days ago

Impressive that it translated "Morgenstund hat Gold im Mund" (morning hour has gold in the mouth) to the equivalent English expression "the early bird gets the worm", instead of going for a literal translation.

I wonder though how much the text in the video was editorialized. For example, I doubt that the model would have correctly capitalized PaLM.

famouswaffles 1092 days ago

Bilingual LLMs make less Literal Translations where appropriate.

https://arxiv.org/abs/2305.16806

ksaj 1091 days ago

And this makes sense, as some sayings would be unrecognizable word salad in other languages and cultures. Early Arctic birds don't catch many worms.

Gold in the mouth is something popularized by rappers (grills) back in the 90's, so that doesn't translate well at all for me.

bamboozled 1092 days ago

I actually really liked the literal translation, I thought it was cool even though I'd never heard it before, "oh well"...

ksaj 1091 days ago

Since it is an LLM, you should be able to ask it for a literal translation if that's what you want.

criddell 1092 days ago

For some reason I’ve been getting 12-20 spam calls per day (all for the same Medicaid/Medicare scam). I’m on T-Mobile which was one of the first carriers to roll out STIR/SHAKEN and I have their Scam Buster app installed and they are getting by all of that. It’s frustrating.

When I read about things like AudioPaLM, my first thought is of all the people in these call centers who seem to uniformly have pretty hard Indian accents and very American-sounding names (George Bush called me the other day!). Their days of working in a call center are numbered and their replacement is going to be a machine that is way cheaper to employ and better at the job.

pessimizer 1092 days ago

I'm more worried about the Philippines. Call center work is supporting the lower (younger) end of an educated bilingual middle-class there just as in India, but India has had more time to develop more options for those people than the Philippines has had.

ChatGTP 1089 days ago

The goal for business is to basically replace everyone with a computer, so I'd be worried about, everyone :)

But actually, what is interesting to think about is that the desire to learn English will likely start to diminish from this. If there is little gain to learning it, like, the computer will just take your job, would you still bother?

I mean some will remain interested, but many won't.

Jeff_Brown 1089 days ago

I live in Colombia and can testify that the demand to know English is enormous, and dwarfs the subset of demand for which call center work is responsible. Programming languages are documented in English. Scientific papers are written in English. Jobs in America require English. International travel outside of Latin America is much easier if you speak English. Etc.

zoklet-enjoyer 1092 days ago

I get those all day everyday. Fun to mess with them sometimes. I usually tell them my name is Ben Chode and my birthday is April 20, 1969

criddell 1092 days ago

I just want there to be consequences for the abuse of the phone system and harassment that results. Nobody cares though.

The phone company will change your number if you want. The FCC will let you report these - one call at a time.

I actually thought about making an app to let me submit a report with a single click. If I started submitting 40-80 reports a week, would that get anybody’s attention? Would somebody at the FCC contact T-Mobile on my behalf and ask them to actually help me with this? Probably not.

zoklet-enjoyer 1092 days ago

Agreed. It basically makes my phone useless as a phone. I generally don't answer unrecognized calls unless I'm up for messing with a scammer, because that's who is usually calling. When I am expecting a phone call that's a problem because I sometimes don't recognize the number so don't answer. And my voicemail inbox is often just filled with this garbage.

mdaniel 1092 days ago

Relevant: https://lennytroll.com/about.php (example: https://www.youtube.com/watch?v=frlde-PUrPA )

robterrell 1092 days ago

Earlier this week I got spam call that was almost certainly an AI-generated human voice.

rhogar 1092 days ago

Though inference for the 8B model is almost definitely not capable of near real time inference yet, we’re approaching babelfish territory. Main difference perhaps being this is powered by burning massive amounts of carbon as opposed to a fish brain.

gwern 1092 days ago

> Though inference for the 8B model is almost definitely not capable of near real time inference yet

Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.

Kinrany 1092 days ago

I wonder if it can translate from English into English Spoken By Five Year Old

zb3 1092 days ago

Hey Google, what about finally giving me the access to MusicLM?

famouswaffles 1092 days ago

You can use MusicLM on google's ai test kitchen

zb3 1092 days ago

I can't as I was not granted access to that (I filled the form).

villgax 1091 days ago

What a joke, 8Billion parameters to gain 1 percent compared to 1.5B of largest Whisper model

ChatGTP 1092 days ago

I can't wait till everyone is using this and we have absolutely zero idea whether or not it's actually translating things correctly or using it's own interpretations of things, going to be...awesomeeeeeeeee!

famouswaffles 1092 days ago

Sota Bi/Multilingual LLMs with good enough representation of the languages (takes much less data than you'd think) are human level translators. Hallucinations on tasks like Summaries, translations etc are near non-existent.

ChatGTP 1092 days ago

Thank you for reminding me that it's going to be, awesommeeee.

Curious, do you speak more than one language?

Edit: I just had a look at your comment history, do you realize you're like, incredibly pro LLM? Do you just scour HN looking for LLM articles and comment on them in a positive way? Not having a poke it's just interesting how keen you are.

famouswaffles 1092 days ago

Yes. And although it's not the language I'm familiar with, I tested GPT and GLM-130b on Mandarin also.

hfhdjdks 1092 days ago

Are you american by any chance?

Over here people speak multiple languages. I doubt we'll run out of people that speak multiple languages just because there's a language model that can do great translations.

famouswaffles 1091 days ago

Your comment history is fairly LLM skeptic. I'm not sure what that has to do with anything. The only difference in this instance is that I've actually tested GPT-4 on translations while you haven't.

If you're going to rag on a product's capabilities on x, you'd think the least you could do is use it for x first.

ChatGTP 1091 days ago

How on earth do you know people have or haven’t done?

Are you spying on everyone ?

famouswaffles 1091 days ago

It's obvious you haven't lol. Your comment reads like someone who hasn't and you never bothered saying you had but just didn't agree on the issue of uality. Even now, your defense isn't "but i have", it's "how do you know i haven't ?", a tell tell sign of someone who actually hasn't bothered.

blovescoffee 1092 days ago

I share the same sentiment as the original commenter and I speak more than one language. Why do you ask?

ChatGTP 1092 days ago

Because virtually everyone tests these things with two languages they're familiar with, else you couldn't really verify if it was correct or not. For languages you're not familiar with, you don't have the "mental mode" to talk using a translator, that is there is more to this than just "talking", there is cultural norms, local dialects, slangs etc which are to be respected when learning and speaking languages with native speakers. When a person who speaks English and Italian tests these things. They know what they're in for an compensate a bit.

Google translate screws up for me really, really hard sometimes when I'm speaking Korean but I'm already a pretty strong speaker, native so I know how to work with the screw ups...and laugh about the really bad ones. I'm not going to go into a meeting and blast off with an auto-translator without understanding what I'm saying or have someone to make sure I'm saying the right thing by talking with them first.

I personally wouldn't feel comfortable using something like this for anything of real significance, a really good translator can ensure the message gets delivered.

seanthemon 1092 days ago

Do you use google translate for anything of significance?