| The main thing I've seen to be different across languages is tokenization/lemmatization. Something as simple as wanting to split a sentence into words can be difficult, e.g. you may want to be able to split German compound nouns into their component words, and to do that you need a model (or list) of nouns so that you can identify these. Or if you're doing Chinese which doesn't have spaces. A bunch of the deep learning work starts from characters for this reason - you get to avoid that messy step, though in Chinese, characters may not quite be the best representation either, maybe you want to break down characters into their component radicals (I don't actually know this, I don't work on Chinese NLP, and have not run this theory past any chinese speakers) But if you're not just throwing everything in a Char-LSTM, you may want to do things like lemmatization so that you can generalize across different forms of a word, or maybe you want to use lemmatization info to inform your tokenization, so that you don't lose the form info. But, really, one big advantage of Neural Nets is that you don't need to do this, that you can just get a big pile of labels via MTurk/users and train on that without really needing to understand the language you're working on very deeply. |
No you don't, if what you want to process is text. You're right however that a big problem is the segmentation that must happen before any processing and that cannot be done 100% correctly by software. Thus, errors compounds down the chain.