| HN Mirror

[ex-Googler, used to work on search, this issue came up repeatedly during my tenure then].

The storage cost was prohibitive. Search engines rely on a data structure known as an inverted index; it's basically a list, for each token, of every document that contains the token, and for a context-aware search engine like Google it usually contains the position within the document of the token as well. Single-character punctuation marks like periods, commas, parentheses, dashes etc. appear in literally every sentence. That means that the inverted index for periods or commas would have to contain an entry for literally every single sentence on the web.

There's a similar problem for common words like 'a', 'the', prepositions, etc, but these are usually already solved by stopwording.

That's why this announcement only covers groups of punctuation with 2-3 characters. These don't appear in ordinary text, and so you can generate posting lists for them that are reasonably-sized. (I suspect that the economics of the index have changed as well, making storage costs cheaper, but this work happened after I left and so I don't know details.)