> We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2.
Impressive that it translated "Morgenstund hat Gold im Mund" (morning hour has gold in the mouth) to the equivalent English expression "the early bird gets the worm", instead of going for a literal translation.
I wonder though how much the text in the video was editorialized. For example, I doubt that the model would have correctly capitalized PaLM.
For some reason I’ve been getting 12-20 spam calls per day (all for the same Medicaid/Medicare scam). I’m on T-Mobile which was one of the first carriers to roll out STIR/SHAKEN and I have their Scam Buster app installed and they are getting by all of that. It’s frustrating.
When I read about things like AudioPaLM, my first thought is of all the people in these call centers who seem to uniformly have pretty hard Indian accents and very American-sounding names (George Bush called me the other day!). Their days of working in a call center are numbered and their replacement is going to be a machine that is way cheaper to employ and better at the job.
I'm more worried about the Philippines. Call center work is supporting the lower (younger) end of an educated bilingual middle-class there just as in India, but India has had more time to develop more options for those people than the Philippines has had.
The goal for business is to basically replace everyone with a computer, so I'd be worried about, everyone :)
But actually, what is interesting to think about is that the desire to learn English will likely start to diminish from this. If there is little gain to learning it, like, the computer will just take your job, would you still bother?
I mean some will remain interested, but many won't.
I live in Colombia and can testify that the demand to know English is enormous, and dwarfs the subset of demand for which call center work is responsible. Programming languages are documented in English. Scientific papers are written in English. Jobs in America require English. International travel outside of Latin America is much easier if you speak English. Etc.
I just want there to be consequences for the abuse of the phone system and harassment that results. Nobody cares though.
The phone company will change your number if you want. The FCC will let you report these - one call at a time.
I actually thought about making an app to let me submit a report with a single click. If I started submitting 40-80 reports a week, would that get anybody’s attention? Would somebody at the FCC contact T-Mobile on my behalf and ask them to actually help me with this? Probably not.
Agreed. It basically makes my phone useless as a phone. I generally don't answer unrecognized calls unless I'm up for messing with a scammer, because that's who is usually calling. When I am expecting a phone call that's a problem because I sometimes don't recognize the number so don't answer. And my voicemail inbox is often just filled with this garbage.
Though inference for the 8B model is almost definitely not capable of near real time inference yet, we’re approaching babelfish territory. Main difference perhaps being this is powered by burning massive amounts of carbon as opposed to a fish brain.
> Though inference for the 8B model is almost definitely not capable of near real time inference yet
Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.
I can't wait till everyone is using this and we have absolutely zero idea whether or not it's actually translating things correctly or using it's own interpretations of things, going to be...awesomeeeeeeeee!
Sota Bi/Multilingual LLMs with good enough representation of the languages (takes much less data than you'd think) are human level translators. Hallucinations on tasks like Summaries, translations etc are near non-existent.
Thank you for reminding me that it's going to be, awesommeeee.
Curious, do you speak more than one language?
Edit: I just had a look at your comment history, do you realize you're like, incredibly pro LLM? Do you just scour HN looking for LLM articles and comment on them in a positive way? Not having a poke it's just interesting how keen you are.
Over here people speak multiple languages. I doubt we'll run out of people that speak multiple languages just because there's a language model that can do great translations.
Your comment history is fairly LLM skeptic. I'm not sure what that has to do with anything.
The only difference in this instance is that I've actually tested GPT-4 on translations while you haven't.
If you're going to rag on a product's capabilities on x, you'd think the least you could do is use it for x first.
It's obvious you haven't lol. Your comment reads like someone who hasn't and you never bothered saying you had but just didn't agree on the issue of uality. Even now, your defense isn't "but i have", it's "how do you know i haven't ?", a tell tell sign of someone who actually hasn't bothered.
Because virtually everyone tests these things with two languages they're familiar with, else you couldn't really verify if it was correct or not. For languages you're not familiar with, you don't have the "mental mode" to talk using a translator, that is there is more to this than just "talking", there is cultural norms, local dialects, slangs etc which are to be respected when learning and speaking languages with native speakers. When a person who speaks English and Italian tests these things. They know what they're in for an compensate a bit.
Google translate screws up for me really, really hard sometimes when I'm speaking Korean but I'm already a pretty strong speaker, native so I know how to work with the screw ups...and laugh about the really bad ones. I'm not going to go into a meeting and blast off with an auto-translator without understanding what I'm saying or have someone to make sure I'm saying the right thing by talking with them first.
I personally wouldn't feel comfortable using something like this for anything of real significance, a really good translator can ensure the message gets delivered.
Direct link to demo video showing speech-to-speech translation: https://google-research.github.io/seanet/audiopalm/examples/... (see website for more example)