| Why does it matter? You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct. In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway. There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already. How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do. |
Edit: https://en.wikipedia.org/wiki/Bag-of-words_model