Show HN: Language detection as a service

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Language detection as a service (getlang.io)
	38 points by mgaudin 4592 days ago

24 comments

davidjgraph 4592 days ago

I'll ask plainly what others are hinting at : Is this actually your own built service, or are you a proxy for something like Google Translate API[1]?

If it's your own built service, it's critical how you explain the hows and whys of your forecast availability and scalability numbers for your chosen architecture, given who you are competing with.

[1]https://developers.google.com/translate/v2/using_rest#detect...

link

beering 4592 days ago

Alternatively, people can just download langid.py[1] and do language detection locally. This is not a particularly hard problem - I think it's doable by undergrad ML or NLP classes.

The tricky parts are usually political - are users going to be angry if you confuse Indonesian with Malaysian, or so on?

[1] https://github.com/saffsd/langid.py

link

microtonal 4592 days ago

I think it's doable by undergrad ML or NLP classes.

In fact, we had a course for high school students where they learnt how a language guesser works and where they had to change a language guesser. A simplistic method that already works very well is:

* Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams.

* To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum.

Of course, you can do fancier things, such as training a SVM or logistic regression classifier with n-grams and words as features, etc.

An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes.

link

matiasb 4592 days ago

"An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes."

Do you know any interesting work related to the language distinction idea on the same text?

link

microtonal 4592 days ago

I have never looked into that in detail. These may be some interesting leads:

http://mt-archive.info/IJCNLP-2008-Ehara.pdf http://202.41.85.68/knm-publications/lang_id_jql.pdf

link

ma2rten 4592 days ago

It's easy to write a language guesser, but's not easy to write a good one. Even Google Translate is not prefect (see below).

link

Radim 4592 days ago

Great point. Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

There's a multitude of problems with real-world texts that a robust guesser must deal with gracefully: short texts; texts in none of the languages the "guesser" was trained for (is it able to return "none of the above?" or does it return a random one then?); texts in multiple languages (incl. common noun phrases phrases inserted into text in another language); texts with parts repeated multiple times (web pages and blogs in particular are a bitch!), which skews char/word distributions and messes up statistical models etc.

It's the same thing as with spelling correction, really. "But Norvig did it in 1.5 lines of Python!" See "A Spellchecker Used To Be A Major Feat of Software Engineering" at https://news.ycombinator.com/item?id=3466927 Spoiler: it still is, except for "drive-by ML apps".

link

microtonal 4592 days ago

Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

A bit sour, are we? ;)

The point is that it is an NLP task where it is relatively easy to get good results on general text (see Cavnar and Trenkle). So, it is a fun and satisfying exercise.

Saying there is difficult noisy data is pointing out the obvious ;).

link

Radim 4592 days ago

If it's obvious to you, then you're not the target audience of my disclaimer :)

But HN responses to posts like these overwhelmingly suggest it's far from obvious.

  So, it is a fun and satisfying exercise.

I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).

link

microtonal 4592 days ago

It's easy to write a language guesser, but's not easy to write a good one.

Obviously, it is highly domain and text length dependent (as I also mentioned in another comment).

But, e.g. Cavnar and Trenkle obtained a 99.8% accuracy on newsgroup articles in 14 languages using the method outlined above.

There are very few NLP tasks where you can achieve such high accuracy with relatively simple and understandable methods. That's why it is a nice subject for an NLP introduction to e.g. high school students.

I have worked in parsing and generation, where it is difficult to obtain satisfying results with many man years of work on newspaper text, let alone tweets or Youtube comments ;).

link

chrismorgan 4592 days ago

The design is fine, but the language used on the page itself isn't quite right.

I see three spelling errors in your language list:

- Panjabi should be Punjabi;

- Teligu should be Telugu;

- Ukraininan should be Ukrainian.

There are also a few grammar problems earlier in the document, and style problems (e.g. English doesn't use a space before sentence-ending punctuation marks).

link

mdemare 4592 days ago

Hmm, it takes 5+ seconds to get a response, and it chokes on the same test phrase as Google, thinking "Ik hou van vette lettertypes." is Norwegian...

link

ma2rten 4592 days ago

It's probably overloaded because it's on hackernews and is based on the same features (character n-grams) as Google Translate. Your text is simply too short for character n-grams to be 100% reliable.

link

diasks2 4592 days ago

Looks interesting. Why not have a input on the landing page where someone can try it out without even signing up? I think then people could give it a spin before they give away their email address. Otherwise, the user just has to trust your 99% figure, which it might be helpful to give some data around, even if it is a footnote (on a corpus of x, over x period of time, etc.)

Also, I think it would be clearer if it said "A simple and scalable way to automatically classify text by language" instead of "A simple and scalable way to classify automatically text by language".

Design looks very clean though. Nice work.

EDIT: Also, your social media links at the bottom aren't hooked up yet.

link

himal 4592 days ago

Hint: You can enter any email address you want.you don't have to validate it.(well, at least for now)

link

captn3m0 4592 days ago

For those who thought (like me) that this was a programming language detection service, you can take a look at github/linguist.

link

microtonal 4592 days ago

Also, for those who would like to know how you can implement a language guesser (sources + link to paper):

http://www.let.rug.nl/vannoord/TextCat/

Python version:

http://thomas.mangin.com/data/source/ngram.py

It's something that is fun to implement and doesn't take more than a few hours at most.

link

mdemare 4592 days ago

Why is this better than the Google or Bing translate APIs, which also offer language detection?

link

redox_ 4592 days ago

You should also consider full-non-ambiguous words before trying with trigrams. "marché" is only available in French, whereas "mar", "arc", ... are available in lots of languages. This should drastically improve your results.

link

redox_ 4592 days ago

Store only the top N common non-ambiguous words if the RAM consumption matters ;)

link

microtonal 4592 days ago

Or store the lexicon in a determinisitic acyclic finite state automaton. E.g. (shameless plug):

https://github.com/danieldk/dictomaton

Though, having implemented a language guesser myself, it's only an issue with very short texts (a few words). On longer texts models based on character n-grams achieve very high accuracies.

link

alexott 4592 days ago

And it looks like that they are using the following library: http://code.google.com/p/language-detection/ - at least the number & list of languages is very similar :-)

link

ma2rten 4592 days ago

or just the same training data...

link

web64 4592 days ago

I've used detectlanguage.com[1] in the past, which seems like a very similar service to getlang.io. With both of them it is hard to know what is behind the scenes...

[1] http://detectlanguage.com/

link

jhull 4592 days ago

I wonder how this performs on short text posts like tweets. At my last gig where we did social media text analysis we used a few different packages (chromium, guess-language, and our own ngram classifier) and still had pretty low accuracy for tweets.

link

AznHisoka 4592 days ago

Have you look at the metadata returned by a tweet? They also returned language, as well as location of the tweeter, which gives you some clues.

link

himal 4592 days ago

You guys might want to handle GET requests for /try URL(https://getlang.io/try) as well.currently it's returning "Server Error (500)" for GET requests.

link

martingordon 4592 days ago

Matthew Kirk spoke about a neural network language predictor at RubyConf a few weeks ago. Here are his slides and code: http://modulus7.com/rubyconf/

link

efeamadasun 4592 days ago

I don't know why I can't stand this sentence "A simple and scalable way to classify automatically text by language". "Classify" and "automatically" need to switch places.

link

alexott 4592 days ago

Apache Tika (http://tika.apache.org/) also has language detector, although it maybe not so good as CLD...

link

razvvan 4592 days ago

If I were to implement this I'd rather use google's prediction api. At least with that you get a bit of control over what goes into the training data.

link

bkamapantula 4592 days ago

It's Telugu not Teligu. By Panjabi, do you mean Punjabi?

As others already mentioned, it would be good to have users try examples before signup.

link

donutdan4114 4592 days ago

"test it out" comes back as french...

link

oedj 4592 days ago

Maybe you've fallen in the 1% error rate ?

link

afsina 4592 days ago

Language guessing is rather hard when few letters are used especially if you use statistical methods. I think after 20 something letters you enter >%95 accuracy zone. In a simple library I wrote ( https://github.com/ahmetaa/zemberek-nlp/tree/master/lang-id Works for 60 languages but no docs yet) , for Turkish and English test results are:

For 20 letters

TR=95.90 EN=94.96

For 50 Letters

TR=99.44 EN=99.53

If 50 letters are used in a Doc, it identifies about 20000 docs per second in a decent desktop.

link

phpnode 4592 days ago

how does this compare in accuracy to chromium's Compact Language Detector?

https://code.google.com/p/chromium-compact-language-detector...

https://github.com/mzsanford/cld

link

alexott 4592 days ago

From my experience, the CLD works pretty well in the most cases. But you need to take care for encoding detection...

link

dbuxton 4592 days ago

Yes, but you presumably need to get that right in order to encode as UTF-8 and send off to a third-party API...

link

RBerenguel 4592 days ago

Some day I have to rewrite whatlanguageis.com (currently not working) with all the great ideas I had to improve it...

link

ssiddharth 4592 days ago

It might be mild OCD but it'd be great if the list of supported languages is ordered in some logical way.

link

m4tthumphrey 4592 days ago

curl -XPOST -d 'hello' 'https://getlang.io/get?token=...' { "code": "fi", "name": "suomi, suomen kieli", "name_en": "Finnish" }

O_O

link

ismaelc 4592 days ago

Where's the login page? I need to get my token

link