Hacker News new | ask | show | jobs
by Radim 4592 days ago
Great point. Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

There's a multitude of problems with real-world texts that a robust guesser must deal with gracefully: short texts; texts in none of the languages the "guesser" was trained for (is it able to return "none of the above?" or does it return a random one then?); texts in multiple languages (incl. common noun phrases phrases inserted into text in another language); texts with parts repeated multiple times (web pages and blogs in particular are a bitch!), which skews char/word distributions and messes up statistical models etc.

It's the same thing as with spelling correction, really. "But Norvig did it in 1.5 lines of Python!" See "A Spellchecker Used To Be A Major Feat of Software Engineering" at https://news.ycombinator.com/item?id=3466927 Spoiler: it still is, except for "drive-by ML apps".

1 comments

Often overlooked by people who only know what I call "drive-by machine learning" (finished an online ML course or something).

A bit sour, are we? ;)

The point is that it is an NLP task where it is relatively easy to get good results on general text (see Cavnar and Trenkle). So, it is a fun and satisfying exercise.

Saying there is difficult noisy data is pointing out the obvious ;).

If it's obvious to you, then you're not the target audience of my disclaimer :)

But HN responses to posts like these overwhelmingly suggest it's far from obvious.

  So, it is a fun and satisfying exercise.
I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).
I agree. Perhaps you can help evangelize the world of difference between "fun exercise" and a production-ready system (the OP is a paid service).

I used to be a bit upset when someone claims to have implemented a state-of-the-art POS tagger, when they just took the dictionary and rules produced by Eric Brill's learner verbatim and apply those. Or worse, they take the first ten rules ;).

Nowadays I just prefer to evolution let do its work. The best or the one with the best marketing wins :).

Liberating approach Daniel!

I'm still in the naive do-it-well phase, but seeing the downvotes, it may be time to join the hipsters. Or at least shut up ;)