I've been in many situations where I wanted translations, and I can't think of one where I'd actually want to rely on either glasses or the airpods working like they do in the demos.
The crux of it for me:
- if it's not a person it will be out of sync, you'll be stopping it every 10 sec to get the translation. One could as well use their phone, it would be the same, and there's a strong chance the media is already playing from there so having the translation embedded would be an option.
- with a person, the other person needs to understand when your translation in going on, and when it's over, so they know when to get an answer or know they can go on. Having a phone in plain sight is actually great for that.
- the other person has no way to check if your translation is completely out of whack. Most of the time they have some vague understanding, even if they can't really speak. Having the translation in the glasses removes any possible control.
There are a ton of smaller points, but all in all the barrier for a translation device to become magic and just work plugged in your ear or glasses is so high I don't expect anything beating a smartphone within my lifetime.
Some of your points are already considered with current implementations. Airpods live translate uses your phone to display what you say to the target person, and the target person's speech is played to your airpods. I think the main issue is that there is a massive delay and apple's translation models are inferior to ChatGPT. The other thing is the airpods don't really add much. It works the same as if you had the translation app open and both people are talking to it.
Aircaps demos show it to be pretty fast and almost real time. Meta's live captioning works really fast and is supposed to be able to pick out who is talking in a noisy environment by having you look at the person.
I think most of your issues are just a matter of the models improving themselves and running faster. I've found translations tend to not be out of whack, but this is something that can't really be solved except by having better translation models. In the case of Airpods live translate the app will show both people's text.
It's understating the lag. Faster will always be better, but even "real time" still requires the other person to complete their sentence before getting a translation (there is the edge case of the other language having similar grammatical structure and word order, but IMHO that's rare), and you catch up from there. That's enough lag to warrant putting the whole translation process literally on the table.
I see the real improvements in the models, for IRL translation I just think phones are very good at this and improving from there will be exponentially difficult.
IMHO it's the same for "bots" intervening (commenting/reacring on exchanges etc.) in meetings. Interfacing multiple humans in the same scene is always a delicate problem.
I have the G1 glasses and unfortunately the microphones are terrible, so the live translation feature barely works. Even if you sit in a quiet room and try to make conditions perfect, the accuracy of transcription is very low. If you try to use it out on the street it rarely gets even a single word correct.
This is the sad reality of most if these AI products and it’s that they are just taking poor feature implementations on the hardware. It seems like if they just picked one or these features and doing it well will make the glasses useful.
Meta has a model just for isolating speech in noisy environments (the “live captioning feature”) and it seems that’s also the main feature of the Aircaps glasses. Translation is a relatively solved problem. The issue is isolating the conversation.
I’ve found meta is pretty good about not overdelivering on promised features, and as a result even though they probably have the best hardware and software stack of any glasses, the stuff you can do with the Rayban displays are extremely limited.
Is it even possible to translate in real time? In many languages and sentences the meaning and translation needs to completely change all thanks to one additional word at the very end. Any accurate translation would need to either wait for the end of a sentence or correct itself after the fact.
Live translation is a well solved problem by this point — the translation will update as it goes, so while you may have a mistranslation visible during the sentence, it will correct when the last word is spoken. The user does need to have awareness of this but in my experience it works well.
Bear in mind that simultaneous interpretation by humans (eg with a headset at a meeting of an international organisation) has been a thing for decades.
The crux of it for me:
- if it's not a person it will be out of sync, you'll be stopping it every 10 sec to get the translation. One could as well use their phone, it would be the same, and there's a strong chance the media is already playing from there so having the translation embedded would be an option.
- with a person, the other person needs to understand when your translation in going on, and when it's over, so they know when to get an answer or know they can go on. Having a phone in plain sight is actually great for that.
- the other person has no way to check if your translation is completely out of whack. Most of the time they have some vague understanding, even if they can't really speak. Having the translation in the glasses removes any possible control.
There are a ton of smaller points, but all in all the barrier for a translation device to become magic and just work plugged in your ear or glasses is so high I don't expect anything beating a smartphone within my lifetime.