Hacker News new | ask | show | jobs
by 8192kjshad09- 1421 days ago
Quoted search is provably broken for some queries, try a google search for "[::]" (with the double quotes), it has no results. Similarly, try a search for 'linux next hop "[::]"' (with the double quotes), none of the results will contain [::].

Proof: https://archive.ph/AAa6k and https://archive.ph/9WGe7

3 comments

> none of the results will contain

More and more frequently I was getting this for the actual search terms, in quotes or not, to the point where I would Control+F any words just to find none of them existed on the page. It's the reason I switched to dumber search engines.

I've assumed the "fast path" is to search for "phrases with similar meaning", rather than actual words. But that really destroys technical searches.

If you haven't read our post, I'd encourage doing do. Quotes do work to find the exact terms specified. But control-F won't locate some of the terms we find when fully rendering a doc -- that's why the list explains using developer tools to search if control-F comes up with nothing.
It might be useful to offer people a way to search for content that is rendered in the page, rather than content that is only visible in developer tools.
The content is rendered on the page. For example, say someone has an email sign-up box. When the page renders, the box appears and it might list all the countries in the world, so that you can pick your country from the list. All those countries are rendered, available if you use the box. But if you ctrl-f search, you might not see that text even though it did render. Real case I looked into which prompted the tip of using developer tools.
This isn't evidence of anything changing, google has always ignored punctuation - treating it as whitespace, as mentioned in the article.
Interesting - looks like they're doing this via a bunch of special-case rules.

To any google engineers reading:

Please add `really-verbatim` mode, indicated by backtick quotes, which also requires strict matching of punctuation.

I'm a Google engineer way too far organisationally to ever have any say in this.

I wonder if that will ever be worth the hardware cost. Back when I did some coursework on information retrieval, it seemed that you get superlinear savings via reducing the cardinality of tokens. So you'd do stemming, remove all punctuation, words that are too frequent ("do", "be", "and", "or", ...)... Basically remove all grammar. You do the same to your search query and the index. This intuitively reduces your compute by at least an order of magnitude, especially for languages with rich grammar (e.g. stemming nouns in Polish reduces the cardinality of tokens by a factor of 7 and verbs by a factor of 162).

No way they'll inflate their indexes even 20% and add complexity into their algorithms for 0.1% queries that won't bring any additional income.
They don't necessarily have to inflate their indexes. Backtick-quoted results ought to be a subset of double-quoted results, so they can use the standard quoted search algorithm, and then filter out imperfect matches from those results.
Google searches ignore punctuation, so it's not even indexed, so there's no way to search for punctuation without inflating the index
I work for Google Search. We did look a this, and we'll keep looking to see if we can improve, but it turns out to be a very hard lift.
The post explains that we see some punctuation as spaces so that query is a search for nothing, which is why it fails.