|
|
|
|
|
by binarymax
3392 days ago
|
|
When indexing documents (and querying for them), there is a process the terms go through, to split them up and then normalize them so they can be found easier. A trivial example is you want to find "can't" when someone searches for "cant". Typically special characters are removed for several reasons: the vocabulary of terms becomes smaller and saves space and time, you can ignore punctuation (like searching things that jut against parenthesis), you can remove accent marks and diacritics, and a host of other things. This is hard because (a) you have to have contextual awareness of punctuation in certain places, like the '&' character in john&jane vs the logical &&. (b) your vocabulary of terms becomes larger - which is probably not a big deal for most folks but if you are Google then a 0.0001% increase in the vocab is a killer in space. --EDIT-- The vocab increase is probably not as much as I noted above - but even adding a dozen terms can have an impact at Google's scale. |
|
In that case, I think you want to just store "can't" and treat "cant" the way you would any other potential near-miss spelling of a more common word.