| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by binarymax 3438 days ago

When indexing documents (and querying for them), there is a process the terms go through, to split them up and then normalize them so they can be found easier. A trivial example is you want to find "can't" when someone searches for "cant". Typically special characters are removed for several reasons: the vocabulary of terms becomes smaller and saves space and time, you can ignore punctuation (like searching things that jut against parenthesis), you can remove accent marks and diacritics, and a host of other things.

This is hard because (a) you have to have contextual awareness of punctuation in certain places, like the '&' character in john&jane vs the logical &&. (b) your vocabulary of terms becomes larger - which is probably not a big deal for most folks but if you are Google then a 0.0001% increase in the vocab is a killer in space.

--EDIT-- The vocab increase is probably not as much as I noted above - but even adding a dozen terms can have an impact at Google's scale.

2 comments

JoshTriplett 3438 days ago

> A trivial example is you want to find "can't" when someone searches for "cant".

In that case, I think you want to just store "can't" and treat "cant" the way you would any other potential near-miss spelling of a more common word.

link

chimprich 3438 days ago

OK, thanks - that sounds plausible on the face of it, but why wouldn't you store special characters and then ignore them when matching patterns? You could then make an exception for strings in quotes (or some other option for activating a more precise search).

Maybe Google hasn't previously thought the extra space/complexity was worth the special treatment but given the relative quantity of data they already index and the usefulness of this feature I'm surprised.

link

nostrademons 3438 days ago

[ex-Googler, used to work on search, this issue came up repeatedly during my tenure then].

The storage cost was prohibitive. Search engines rely on a data structure known as an inverted index; it's basically a list, for each token, of every document that contains the token, and for a context-aware search engine like Google it usually contains the position within the document of the token as well. Single-character punctuation marks like periods, commas, parentheses, dashes etc. appear in literally every sentence. That means that the inverted index for periods or commas would have to contain an entry for literally every single sentence on the web.

There's a similar problem for common words like 'a', 'the', prepositions, etc, but these are usually already solved by stopwording.

That's why this announcement only covers groups of punctuation with 2-3 characters. These don't appear in ordinary text, and so you can generate posting lists for them that are reasonably-sized. (I suspect that the economics of the index have changed as well, making storage costs cheaper, but this work happened after I left and so I don't know details.)

link

Buge 3438 days ago

You need to double the size of the index. You now need an index with punctuation and without punctuation.

Previously if a document contained "(hello" it would just be stored in the index once: as "hello". With this change, it needs to be stored in the index twice, as "(hello" and "hello", so that people searching either term can find it.

link