| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by breuderink 4315 days ago

One method that I have used in the past was über-simple, yet extremely effective. It exploits ZIP compression, based on the the insight/assumption that two concatenated texts compress beter when they share their language.

I think I found it in this paper [1]. The implementation was like 13 lines of Python code. I wonder how it would compare.

[1] http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/Ben...

1 comments

wooorm 4315 days ago

It’s a very interesting idea. Would it work accurate enough when scaled to 160+ languages?

link

breuderink 4315 days ago

I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).

link

wooorm 4315 days ago

Perhaps you should ;) If, I’d be interest to know how it goes!

link