Hacker News new | ask | show | jobs
by syllogism 4323 days ago
I didn't think of character n-grams, that's a case where yeah, you do want larger n. Same with bioinformatics.

But as far as word ngrams goes, I've been doing NLP research for over ten years, and you almost never want 4 or 5 grams, let alone ngrams of greater length. The data's simply too sparse to be useful. So, it's really a matter of generating bigrams and generating trigrams, which I think it's reasonable to have separate functions for.