When I meet people in VR who are ESL, I can tell based on their accent and mannerisms that they learned English by playing video games with westerners or watched a lot of YouTube.
Do we really want to dilute the uniqueness of language by making everyone sound like they came out of a lab in California?
>Do we really want to dilute the uniqueness of language
I can't speak to whether it's desirable or not, but this has been happening with the advent of radio, movies, and television for over a century. So, are we worse off now, linguistically-speaking, than then? Do we really even notice missing accents if we never grew up with them?
language learning also works fine without emotionality faking, and is depending more on authentic speech recognition (e.g. you want the model to notice if you mispronounce important words, not gloss over it and just continue babble as otherwise this will bite you in the ass in the real world) as well as the system's overall specific ability to generate a personal learning curriculum.
Do we really want to dilute the uniqueness of language by making everyone sound like they came out of a lab in California?