| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saip 2851 days ago

Access to parallel corpora is a limiting factor in general. A good way to train a language translator is to use an open source dataset (several here http://opus.nlpl.eu/) to train a base model, and then fine-tune it with a smaller dataset specific to your domain.

In this case, the author claims pretty good accuracy, almost on par with Google Brain's!

  On my test set of 3,000 sentences, the translator obtained a BLEU score of 0.39. This score is the benchmark scoring system used in machine translation, and the current best I could find in English to French is around 0.42 (set by some smart folks as Google Brain). So, not bad.

1 comments

jeffreyrogers 2851 days ago

Wow, missed that part when I read it. Pretty incredible that using open source data you can outperform the state-of-the-art machine translators of a few years ago.

link

zawerf 2851 days ago

For a historical perspective check out stanford's nlp course: https://youtu.be/IxQtK2SjWWM?t=1267

Deep learning only started beating tradition methods in 2016!

link