|
|
|
|
|
by Radim
4592 days ago
|
|
Great point. Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something). There's a multitude of problems with real-world texts that a robust guesser must deal with gracefully: short texts; texts in none of the languages the "guesser" was trained for (is it able to return "none of the above?" or does it return a random one then?); texts in multiple languages (incl. common noun phrases phrases inserted into text in another language); texts with parts repeated multiple times (web pages and blogs in particular are a bitch!), which skews char/word distributions and messes up statistical models etc. It's the same thing as with spelling correction, really. "But Norvig did it in 1.5 lines of Python!" See "A Spellchecker Used To Be A Major Feat of Software Engineering" at https://news.ycombinator.com/item?id=3466927
Spoiler: it still is, except for "drive-by ML apps". |
|
A bit sour, are we? ;)
The point is that it is an NLP task where it is relatively easy to get good results on general text (see Cavnar and Trenkle). So, it is a fun and satisfying exercise.
Saying there is difficult noisy data is pointing out the obvious ;).