Hacker News new | ask | show | jobs
by breuderink 4268 days ago
One method that I have used in the past was über-simple, yet extremely effective. It exploits ZIP compression, based on the the insight/assumption that two concatenated texts compress beter when they share their language.

I think I found it in this paper [1]. The implementation was like 13 lines of Python code. I wonder how it would compare.

[1] http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/Ben...

1 comments

It’s a very interesting idea. Would it work accurate enough when scaled to 160+ languages?
I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).
Perhaps you should ;) If, I’d be interest to know how it goes!