Hacker News new | ask | show | jobs
by kccqzy 1591 days ago
What's a good way to detect languages in mixed-language passages? What's the state of art here?

For example, given "'I think, therefore I am' is the first principle of René Descartes's philosophy that was originally published in French as je pense, donc je suis.", is there a library that would tell me the main passage is in English, but contains fragments in French?

2 comments

Worth noting that this is on Lingua-Go's issues list for the 1.1.0 version: https://github.com/pemistahl/lingua-go/issues/9
With an ngram-based model like this one, you can just feed it short substrings, since it doesn't take the larger context into account anyway. There'll be some problems at the boundary, because e.g. "as" is a word in both languages.