Hacker News new | ask | show | jobs
by soheil 1731 days ago
Are the n-grams always at most n=2 bigrams?
1 comments

No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags.

I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.

Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?