| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nl 1123 days ago

I'd encourage you to try plain old text classification on ngrams. An ngram approach will pick up lemmas fine, although Spacy will do lemmatization if you prefer.

But I did a lot of work on this type of thing and the only time I found this sentence analysis approach was useful as classifier features was in a legal context where there were variants of very specific language we wanted to find.

There it worked because we could write rules on the features without relying on training data.

Tf-idf on ngrams using a rolling window would certainly work to detect the beheading variants you gave as examples.

Again: try without the parsing features. There's a good reason they are rarely used in classifiers: they are too unreliable to improve performance over simple approaches.