Hacker News new | ask | show | jobs
by diegoserranoa 2947 days ago
I always see all these article, services and products offering NLP for English. I wonder how this works with other languages that have a different structure e.g. Japanese, Arabic, etc. It would also be interesting to see how these algorithms behave when considering cultural aspects: one word or expression may have a different meaning in different places. How would the system handle something like "Your service is the sh*t!". Is that positive? negative? There's probably info on this subject all over the internet already haha very interesting though...
7 comments

There is definitely a feedback loop at play in the NLP community. Since most NLP research is based on English, most applications that build off that research are also tailored to English. Our approach to classifying feedback in other languages is to simply use Google Translate's API and then proceed to classify the English translation. However, you are right in that many aspects of the base feedback are lost in this approach, and our accuracy is lower than if we had a language specific model for each language.

That being said, there is a promising new research paper from fast.ai (https://arxiv.org/abs/1801.06146) that speaks to using Wikipedia data to create a language model for a specific language, which can then be trained specifically on the task you are trying to solve. If this is as effective as the authors state, then NLP could see huge improvements non-English languages where there is already a large set of Wikipedia data.

The main thing I've seen to be different across languages is tokenization/lemmatization.

Something as simple as wanting to split a sentence into words can be difficult, e.g. you may want to be able to split German compound nouns into their component words, and to do that you need a model (or list) of nouns so that you can identify these. Or if you're doing Chinese which doesn't have spaces.

A bunch of the deep learning work starts from characters for this reason - you get to avoid that messy step, though in Chinese, characters may not quite be the best representation either, maybe you want to break down characters into their component radicals (I don't actually know this, I don't work on Chinese NLP, and have not run this theory past any chinese speakers)

But if you're not just throwing everything in a Char-LSTM, you may want to do things like lemmatization so that you can generalize across different forms of a word, or maybe you want to use lemmatization info to inform your tokenization, so that you don't lose the form info.

But, really, one big advantage of Neural Nets is that you don't need to do this, that you can just get a big pile of labels via MTurk/users and train on that without really needing to understand the language you're working on very deeply.

> maybe you want to break down characters into their component radicals (I don't actually know this, I don't work on Chinese NLP, and have not run this theory past any chinese speakers)

No you don't, if what you want to process is text. You're right however that a big problem is the segmentation that must happen before any processing and that cannot be done 100% correctly by software. Thus, errors compounds down the chain.

I'm not "shitting" on the idea. I gave you an informed opinion as someone with Japanese & Mandarin knowledge working in NLP research.

Did you read more than the title of the paper you linked? Because the Stanford paper states:

"Results and Discussion We consistently observed a decrease in performance (i.e. increased for perplexity) with radicals as compared to baseline, in contrast to a significant increase in performance with part-of-speech tags. [...] Such a robust trend indicates that radicals are likely not actually very useful features in language modeling"

For most tasks, you won't get more information on a word by looking at its characters decompositions, in the same way that the individual letters of a lemma won't help you for the task.

There existing use cases however. It is useful when building dictionaries for human beings (for search for example, I just put online such a tool yesterday) and when trying to automatically guess the reading of a character.

Arbitrarily saying "No you don't" isn't indicative an informed opinion.

I haven't really dug into these papers, though the Stanford paper does say "This conclusion is consistent with results from part-of-speech tagging experiments, where we found that radicals of previous word are not a helpful feature, although the radical of the current word is.", whereas the quote you pulled out has to do with language modeling.

Though I wouldn't consider a single negative result from before the deep learning trend took of necessarily indicative of the value.

The more recent paper, on the other hand, sees a positive boost from their "hierarchical radical embeddings" vs traditional word or character embeddings for 4 classification tasks. Not that this is necessarily meaningful either.

In my mind, the usefulness of this would be, not that you would get new information, per se, but that you could generalize some amount of knowledge to rare/out of vocabulary words.

Since you work in the field though, do you have any pointers to good papers on Chinese NLP?

I don't have generic good pointers but a few interesting things I read or downloaded:

- https://aclanthology.info/pdf/I/I05/I05-7002.pdf This paper make use of the radicals to build an ontology, but it does so with a stunting amount of depth (historical context, variants, etc.) that most works overlook. Too bad no data is available.

- http://www.persee.fr/doc/clao_0153-3320_1978_num_4_1_1047 Very interesting read on the formation of Chinese-like characters by the Vietnamese. Some technics described were also used by the Japanese when adopting sinograms.

- didn't read the paper but the references section lists a number of paper about the segmentation of Mandarin http://www.anthology.aclweb.org/F/F12/F12-3001.pdf

- didn't read it yet, but seems to contains accurate information of the Chinese writing system: http://learnlab.org/uploads/mypslc/publications/perfetti-lex...

Anyway, I think for getting a fair understanding of the writing system the learning of about 600 characters in either Chinese or Japanese + basic of the chosen language is required.

Many NLP algorithms, especially ML ones, are actually language agnostic because the input is either tokens or characters. The part that is language dependant is tokenization, normalization, and stemming. That part can be difficult and if not done well can screw up the downstream algorithm. As with everything, there are exceptions.
Of course many services exist for other languages too. I helped research and implement aspect-based sentiment analysis systems for Dutch customer feedback. It is almost the exact same task OP implemented but for Dutch. The project was academic consultancy for a customer relations company and the NLP pipeline is in commercial use. If you're curious you can read the paper here: http://aclweb.org/anthology/W17-5218
Thank you for the link to the paper. We had not come across that one.
Non-English languages have different challenges. Typically the training data is much less, but the language itself has fewer exceptions to rules.

> How would the system handle something like "Your service is the sh* t!"

This is pretty easy to handle correctly with sufficient training data. A good demonstration is the deepmoji sentiment predictor: https://deepmoji.mit.edu/

Try:

Your service is sh* t!

Your service is shit!

Your service is the sh* t!

Your service is the shit!

Works pretty much perfectly.

Edit: how am I supposed to escape the * without leaving a space after it!?

While sentiment analysis is part of NLP.

Another task is understanding the semantics of expressions.

Some basic ones are Man > Woman Dog > cat Monday > Tuesday

Disclaimer I currently work for an NLP company.

But you can query our knowledge base of semantic words here. With your stated languages and more.

http://lexicon.gavagai.se/

Not necessary even Japanese or Arabic, even many European languages are quite tricky...
Forget about other languages, it's sometimes tricky for English as well. Someone said language is the world's oldest API but it's also the most complicated. ;-)

For couple of our Norwegian and Spanish customers we hit Google translate to translate feedback into English and then feed it through our ML engine to classify. Accuracy obviously is not as good as it should but it gives them good insight.