Hacker News new | ask | show | jobs
by nl 1126 days ago
Have you tried Spacy?

I find it substantially better than other tools as PoS tagger.

Also worth noting the that your assertion that you need these features to classify genres isn't obviously true to me at all.

1 comments

No I haven't. Thanks for the pointer.

For detecting uses of nouns like werewolf/werewolves, or vampire/vampires, I at least need the lemma to avoid writing different cases or a regex for each noun. Likewise, lemmatization can be used to handle different spellings (e.g. vampyre, or were-wolf). Similarly for verbs.

Lemmatization works best when it is coupled with part of speech tagging, so you avoid removing the -ing in adverbs for example.

Part of speech tagging also helps avoid incorrect labeling, such as not tagging 'bit' in "a bit is a single binary value" as the verb "to bite".

That's for the simple case.

Then there are more complex cases, like generalizing "[NP] was bitten by the vampire.", where NP can be a personal pronoun (he, she, etc.) or a name. There can also be other ways to say the same thing, e.g. "The vampire bit [NP] neck." where NP is now the object form (his, her, etc.) not the subject form. With UniversalDependencies or similar style dependency relations, you could match and label sentence fragments of the form "verb=bite, nsubj=vampire, obj=NP" (like in the first sentence) and "verb=bite, nsubj:pass=NP, obj=vampire" (like in the second sentence).

Without NLP, it becomes even harder to detect split variants like "cut off his head" and "cut his head off", which are the same thing written in different ways. I want to detect things like that and label the entire fragment "beheading", including other noun phrase variants.

With more advanced NLP features -- like coreference resolution (resolving instances of he/she/etc. to the same person), and information extraction (e.g. Dracula is a vampire) -- it would be possible to tag even more sentences and sentence fragments.

I'd encourage you to try plain old text classification on ngrams. An ngram approach will pick up lemmas fine, although Spacy will do lemmatization if you prefer.

But I did a lot of work on this type of thing and the only time I found this sentence analysis approach was useful as classifier features was in a legal context where there were variants of very specific language we wanted to find.

There it worked because we could write rules on the features without relying on training data.

Tf-idf on ngrams using a rolling window would certainly work to detect the beheading variants you gave as examples.

Again: try without the parsing features. There's a good reason they are rarely used in classifiers: they are too unreliable to improve performance over simple approaches.

I don't see why a simple TFIDF with ~10 LoC and a few minutes doesn't make this at least reasonably/crudely done.