Hacker News new | ask | show | jobs
by saip 2804 days ago
Access to parallel corpora is a limiting factor in general. A good way to train a language translator is to use an open source dataset (several here http://opus.nlpl.eu/) to train a base model, and then fine-tune it with a smaller dataset specific to your domain.

In this case, the author claims pretty good accuracy, almost on par with Google Brain's!

  On my test set of 3,000 sentences, the translator obtained a BLEU score of 0.39. This score is the benchmark scoring system used in machine translation, and the current best I could find in English to French is around 0.42 (set by some smart folks as Google Brain). So, not bad.
1 comments

Wow, missed that part when I read it. Pretty incredible that using open source data you can outperform the state-of-the-art machine translators of a few years ago.
For a historical perspective check out stanford's nlp course: https://youtu.be/IxQtK2SjWWM?t=1267

Deep learning only started beating tradition methods in 2016!