Hacker News new | ask | show | jobs
by microtonal 4595 days ago
I think it's doable by undergrad ML or NLP classes.

In fact, we had a course for high school students where they learnt how a language guesser works and where they had to change a language guesser. A simplistic method that already works very well is:

* Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams.

* To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum.

Of course, you can do fancier things, such as training a SVM or logistic regression classifier with n-grams and words as features, etc.

An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes.

2 comments

"An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes."

Do you know any interesting work related to the language distinction idea on the same text?

I have never looked into that in detail. These may be some interesting leads:

http://mt-archive.info/IJCNLP-2008-Ehara.pdf http://202.41.85.68/knm-publications/lang_id_jql.pdf

It's easy to write a language guesser, but's not easy to write a good one. Even Google Translate is not prefect (see below).
Great point. Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

There's a multitude of problems with real-world texts that a robust guesser must deal with gracefully: short texts; texts in none of the languages the "guesser" was trained for (is it able to return "none of the above?" or does it return a random one then?); texts in multiple languages (incl. common noun phrases phrases inserted into text in another language); texts with parts repeated multiple times (web pages and blogs in particular are a bitch!), which skews char/word distributions and messes up statistical models etc.

It's the same thing as with spelling correction, really. "But Norvig did it in 1.5 lines of Python!" See "A Spellchecker Used To Be A Major Feat of Software Engineering" at https://news.ycombinator.com/item?id=3466927 Spoiler: it still is, except for "drive-by ML apps".

Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

A bit sour, are we? ;)

The point is that it is an NLP task where it is relatively easy to get good results on general text (see Cavnar and Trenkle). So, it is a fun and satisfying exercise.

Saying there is difficult noisy data is pointing out the obvious ;).

If it's obvious to you, then you're not the target audience of my disclaimer :)

But HN responses to posts like these overwhelmingly suggest it's far from obvious.

  So, it is a fun and satisfying exercise.
I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).
I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).

I used to be a bit upset when someone claims to have implemented a state-of-the-art POS tagger, when they just took the dictionary and rules produced by Eric Brill's learner verbatim and apply those. Or worse, they take the first ten rules ;).

Nowadays I just prefer to evolution let do its work. The best or the one with the best marketing wins :).

Liberating approach Daniel!

I'm still in the naive do-it-well phase, but seeing the downvotes, it may be time to join the hipsters. Or at least shut up ;)

It's easy to write a language guesser, but's not easy to write a good one.

Obviously, it is highly domain and text length dependent (as I also mentioned in another comment).

But, e.g. Cavnar and Trenkle obtained a 99.8% accuracy on newsgroup articles in 14 languages using the method outlined above.

There are very few NLP tasks where you can achieve such high accuracy with relatively simple and understandable methods. That's why it is a nice subject for an NLP introduction to e.g. high school students.

I have worked in parsing and generation, where it is difficult to obtain satisfying results with many man years of work on newspaper text, let alone tweets or Youtube comments ;).