Hacker News new | ask | show | jobs
by jeffreyrogers 2815 days ago
This is very cool. One thing I wonder about though is whether small companies will be able to compete with large ones like Google in ML in the future. One reason Google's translator is better is because they have way more data. In the past they digitized tons of books so they have an excellent dataset that has been translated by professional, human translators. This data collection is effectively cross-subsidized by Google's primary business: advertising.

Since most competitors to Google offerings aren't going to have a hugely profitable core business with which to fund all the data collection and normalization that goes into building a high quality ML system, the future for poorly capitalized competitors to compete seems bleak to me. This seems to support some of the growing rumblings about enforcing antitrust laws against the large tech companies.

Edit: better, not bigger.

6 comments

DeepL had a lot of good press when it came out last year. Some saying it was better than Google.

https://www.deepl.com/en/translator

Wow, thank you for mentioning that. I cannot believe how good the translations are! My native tongue is Dutch and I threw in some (long!) English, French and German texts and honestly, they read like they were written by a native speaker. Hugely impressive.
When someone recommended DeepL to me I almost didn't take it seriously expecting mediocre translations that aren't bad, but hardly usable without heavy editing. However after trying it I'm very impressed with its results and in those cases where you have a better translation in mind the interface offers an easy way to suggest and replace expressions. It's impressive.
Access to parallel corpora is a limiting factor in general. A good way to train a language translator is to use an open source dataset (several here http://opus.nlpl.eu/) to train a base model, and then fine-tune it with a smaller dataset specific to your domain.

In this case, the author claims pretty good accuracy, almost on par with Google Brain's!

  On my test set of 3,000 sentences, the translator obtained a BLEU score of 0.39. This score is the benchmark scoring system used in machine translation, and the current best I could find in English to French is around 0.42 (set by some smart folks as Google Brain). So, not bad.
Wow, missed that part when I read it. Pretty incredible that using open source data you can outperform the state-of-the-art machine translators of a few years ago.
For a historical perspective check out stanford's nlp course: https://youtu.be/IxQtK2SjWWM?t=1267

Deep learning only started beating tradition methods in 2016!

EU helps with this too, accidentally. All official documents are translated in all EU languages, with very high quality translators. And all these documents are public.
The value of large corpora for translation may be diminishing.. In particular, Facebook have achieved impressive results using unsupervised ML for translation: https://code.fb.com/ai-research/unsupervised-machine-transla...

The basic idea is to use word vector embeddings to build a source<->target dictionary, then combine this with a language recognition model to iteratively bootstrap a set of source<->target training examples for use with a conventional ML approach.

So the value of a large corpus remains, it's just this one happens to be generated, as opposed to collected.
Deepl is already MUCH better than Google translate: https://www.deepl.com/translator
Another perspective in similar veins would be the rise of AutoML. Given its absurdly high computational cost, I'd think only enterprises with massive computational power at their disposal would be able to use it.