Hacker News new | ask | show | jobs
by mabcat 2108 days ago
> any resource recommendations or tips for writing a translation engine

All the successful translation websites/apps you're familiar with use machine learning. ML stomped all over NLP approaches because it gives a rough translation between so many more languages for so much less work.

On the question of where the data comes from, you might be a bit closer than you think. That dictionary you're transcribing has some sentence pairs, and like yorwba said, sentence pairs are food for ML language models. Extracting all the sentence pairs into a dataset might raise some interest from ML people.

2 comments

I think ML approaches are not well suited here - or at least there are still huge problems between non-germanic/Latin languages that have yet to be approached. (My background: white Australian, not a linguist, know a little Vietnamese, know a very tiny bit about indigenous Australian languages based on chatting with friends who studied that area)

I continually see consumer-facing ml approaches (FB, Google) give terrible Vietnamese translations because they assume all of the context needed for a translation is available in the text. In general this is not the case. In Vietnamese this is hugely obvious because their pronoun system is largely based on 3rd person relationships ("sister walks down the street", "boyfriend loves girlfriend"), which is impossible to map to/from English 2nd person ("you walk down the street", "I love you") without basically a full conscious intelligence. Even FB, which is in a unique circumstance of actually having a lot of the requisite relationship data between people available to it, does a terrible job at this.

My (tiny) understanding of the incredibly rich kinship systems in indigenous Australian cultures suggests that this would be a huge issue there as well, assuming these complexities are also present in their languages. (...OP? :) )

Ahh this is a super useful comment. Thanks for the insight. I will think about how I can get more language pairs too.

I did just finish my first ML CV project recently so kind of in the head space.