Hacker News new | ask | show | jobs
by jsnathan 3221 days ago
On a slight tangent, are there any fully-trained ready-to-work state-of-the-art open source distributions of translation systems available?

I've never been able to find one, but maybe I just haven't looked hard enough.

2 comments

Both Google and Facebook have released pre-trained models in a few languages [0, 1].

[0] https://github.com/facebookresearch/fairseq

[1] https://google.github.io/seq2seq/nmt/

Not that I know of, but the source code is available to train a Transformer model in a single day.

https://github.com/tensorflow/tensor2tensor#walkthrough

Yes there are a number of models, but since the quality of the results depends a lot on the training data as well, which I wouldn't know how to find or evaluate, and possibly might require tweaking algorithms for different languages, which I wouldn't know how to do, it's not really 'usable' (for me).

I figured someone would have gone to the trouble of combining models with a maintained collection of datasets to produce an open source alternative to Google Translate by now. I've been wondering that for years and it never seems to happen. Not saying anyone should feel obligated - I'm just curious why we don't see this, when we see so many other open source software projects that are competetive with their commercial alternatives.

Is it difficult/expensive to acquire these datasets? Is it a lot of effort to actually fine-tune the algorithms to reach passable results?

It seems (without knowing the details myself) that the state of the art in actually usable machine translation tools is always locked up in commercial IP, even though it feels (at least to me) like something that should be a free public service and therefore an ideal candidate for the 'open source' treatment.

I think Mozilla and Safari should be interested in having local translation for better privacy and speed.
I should have mentioned, it's state-of-the-art on open datasets. It's not comparable with DeepL or Google Translate which have their own proprietary datasets. Also, Translation models are very big (gigabytes).