| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rsmith49 2940 days ago

Unfortunately, this is a very common occurrence in NLP applications. Our first step to combat this is through performing a spellcheck step when preprocessing all of our data. Next, some of the algorithms we employ only look for the presence of words in feedback, not necessarily grammatical correctness. So, if we get something like "food good, love love love", we will still be able to recognize the feedback is referring positively to food quality, and our ensemble prediction will reflect this.

Despite this, we still run into some feedback that is complete gibberish, or does not refer to anything. Fortunately, since this is a multi-label classification problem, it is possible for us to classify the feedback as not having any tag associated with it. Therefore, including some of these samples in our training data helps fortify our engine against any live data that may come in without meaning, and allows us to classify that feedback as having no tag associated with it.

In our upcoming blog about our "human in the loop" machine learning system, we also address how we can manually filter samples of data to make our training more efficient.