Hacker News new | ask | show | jobs
by wooorm 4269 days ago
It’s a very interesting idea. Would it work accurate enough when scaled to 160+ languages?
1 comments

I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).
Perhaps you should ;) If, I’d be interest to know how it goes!