|
|
|
|
|
by pacificleo11
3006 days ago
|
|
Importance of data for machine learning algorithms can’t be stressed enough. I remember talking to a friend in Google Translate team. they had a good also but they were struggling to get quality translation data to train their service. the problem was more severe when it came to language which was not very popular. translation set was next to nothing for Say something like Turkish, Hindi, Latvian etc. They finally solved this by using meeting notes from UN Assembly. Which were transcribed by best of the translators? that access to meeting transcription was (unfair ??) advantage Google had over other tools. Was it wrong ? I don’t think so. Should have been those meeting notes be public: Yes |
|
I tried this sentence: "Hei äiti, puhun suomea". The expected translation would be "Hi mom, I speak Finnish".
Instead Google's result was: "July's mother, I speak English".
Obviously the engine had been trained on unvetted data sets where the word "English" occurred in translations in a position where the original had the word "Finnish", and no context was provided to avoid this kind of mistake.
The word "July" came about because "hei" is also used as an abbreviation for "heinäkuu" (July). It was sobering that a supposed world-class AI couldn't distinguish between these two usages. Machine learning needs a lot of old-fashioned handtuned human-made heuristics.