|
|
|
|
|
by Retr0id
1422 days ago
|
|
Interesting - looks like they're doing this via a bunch of special-case rules. To any google engineers reading: Please add `really-verbatim` mode, indicated by backtick quotes, which also requires strict matching of punctuation. |
|
I wonder if that will ever be worth the hardware cost. Back when I did some coursework on information retrieval, it seemed that you get superlinear savings via reducing the cardinality of tokens. So you'd do stemming, remove all punctuation, words that are too frequent ("do", "be", "and", "or", ...)... Basically remove all grammar. You do the same to your search query and the index. This intuitively reduces your compute by at least an order of magnitude, especially for languages with rich grammar (e.g. stemming nouns in Polish reduces the cardinality of tokens by a factor of 7 and verbs by a factor of 162).