|
|
|
|
|
by lrem
1421 days ago
|
|
I'm a Google engineer way too far organisationally to ever have any say in this. I wonder if that will ever be worth the hardware cost. Back when I did some coursework on information retrieval, it seemed that you get superlinear savings via reducing the cardinality of tokens. So you'd do stemming, remove all punctuation, words that are too frequent ("do", "be", "and", "or", ...)... Basically remove all grammar. You do the same to your search query and the index. This intuitively reduces your compute by at least an order of magnitude, especially for languages with rich grammar (e.g. stemming nouns in Polish reduces the cardinality of tokens by a factor of 7 and verbs by a factor of 162). |
|