Hacker News new | ask | show | jobs
by chimprich 3400 days ago
OK, thanks - that sounds plausible on the face of it, but why wouldn't you store special characters and then ignore them when matching patterns? You could then make an exception for strings in quotes (or some other option for activating a more precise search).

Maybe Google hasn't previously thought the extra space/complexity was worth the special treatment but given the relative quantity of data they already index and the usefulness of this feature I'm surprised.

2 comments

[ex-Googler, used to work on search, this issue came up repeatedly during my tenure then].

The storage cost was prohibitive. Search engines rely on a data structure known as an inverted index; it's basically a list, for each token, of every document that contains the token, and for a context-aware search engine like Google it usually contains the position within the document of the token as well. Single-character punctuation marks like periods, commas, parentheses, dashes etc. appear in literally every sentence. That means that the inverted index for periods or commas would have to contain an entry for literally every single sentence on the web.

There's a similar problem for common words like 'a', 'the', prepositions, etc, but these are usually already solved by stopwording.

That's why this announcement only covers groups of punctuation with 2-3 characters. These don't appear in ordinary text, and so you can generate posting lists for them that are reasonably-sized. (I suspect that the economics of the index have changed as well, making storage costs cheaper, but this work happened after I left and so I don't know details.)

You need to double the size of the index. You now need an index with punctuation and without punctuation.

Previously if a document contained "(hello" it would just be stored in the index once: as "hello". With this change, it needs to be stored in the index twice, as "(hello" and "hello", so that people searching either term can find it.