Hacker News new | ask | show | jobs
by ma2rten 4593 days ago
It's easy to write a language guesser, but's not easy to write a good one. Even Google Translate is not prefect (see below).
2 comments

Great point. Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

There's a multitude of problems with real-world texts that a robust guesser must deal with gracefully: short texts; texts in none of the languages the "guesser" was trained for (is it able to return "none of the above?" or does it return a random one then?); texts in multiple languages (incl. common noun phrases phrases inserted into text in another language); texts with parts repeated multiple times (web pages and blogs in particular are a bitch!), which skews char/word distributions and messes up statistical models etc.

It's the same thing as with spelling correction, really. "But Norvig did it in 1.5 lines of Python!" See "A Spellchecker Used To Be A Major Feat of Software Engineering" at https://news.ycombinator.com/item?id=3466927 Spoiler: it still is, except for "drive-by ML apps".

Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

A bit sour, are we? ;)

The point is that it is an NLP task where it is relatively easy to get good results on general text (see Cavnar and Trenkle). So, it is a fun and satisfying exercise.

Saying there is difficult noisy data is pointing out the obvious ;).

If it's obvious to you, then you're not the target audience of my disclaimer :)

But HN responses to posts like these overwhelmingly suggest it's far from obvious.

  So, it is a fun and satisfying exercise.
I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).
I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).

I used to be a bit upset when someone claims to have implemented a state-of-the-art POS tagger, when they just took the dictionary and rules produced by Eric Brill's learner verbatim and apply those. Or worse, they take the first ten rules ;).

Nowadays I just prefer to evolution let do its work. The best or the one with the best marketing wins :).

Liberating approach Daniel!

I'm still in the naive do-it-well phase, but seeing the downvotes, it may be time to join the hipsters. Or at least shut up ;)

It's easy to write a language guesser, but's not easy to write a good one.

Obviously, it is highly domain and text length dependent (as I also mentioned in another comment).

But, e.g. Cavnar and Trenkle obtained a 99.8% accuracy on newsgroup articles in 14 languages using the method outlined above.

There are very few NLP tasks where you can achieve such high accuracy with relatively simple and understandable methods. That's why it is a nice subject for an NLP introduction to e.g. high school students.

I have worked in parsing and generation, where it is difficult to obtain satisfying results with many man years of work on newspaper text, let alone tweets or Youtube comments ;).