Hacker News new | ask | show | jobs
by robitor 4572 days ago
Thanks! I'm taking an information retrieval course right now and I'm interested in applying what I've learned to a cool pet project. I don't think we ever touched on n-gram models for some reason
2 comments

This isn't information retrieval. This is data processing. Information retrieval is a subset of data processing.

Retrieval specifically needs an algorithm to determine document relevance. Everything you're learning is to understand how different parts of that algorithm affect the results. It's a very difficult problem, even if you assume that the corpus isn't sapient.

Stuff like n-grams are more about reshuffling in order to expose patterns. It's a little bit like regressing some noisy data to see the trend of correlation.

I learned about n-grams in a natural language processing class in uni.

A related (but more interesting, imho) concept is the Hidden Markov Model, which is used for things like part-of-speech tagging and areas of speech recognition. It takes a sequence of "observations", like sound vectors or words, and uses a probabilisitic model to match them with "hidden states", like phonemes or parts of speech.

I got a job with a part-of-speech tagging website I made as a pet project for that class :)