Hacker News new | ask | show | jobs
by IgorPartola 1553 days ago
I still don’t get why. It cannot be so difficult for them to keep things like literal search, can it? What is the incentive to remove it and replace it with a needlessly more complex almost literal but still fuzzy search?

I do suspect the main thing people complain about currently with Google is the abundance of ads and the algorithm that has encouraged stupid amounts of articles of a certain length. Recipe for baked potatoes is now 2000 words long.

6 comments

> It cannot be so difficult for them to keep things like literal search, can it?

Greater scale = greater cost of keeping data hot in their search data-warehouses (esp. in light of contention over memory/caches.) Keeping around both a source-text string and its tsvector representation (or whatever Google's version of that is) is a "thing that doesn't scale" that they could provide at 1B queries/day, but probably not at 10B queries/day.

> the algorithm that has encouraged stupid amounts of articles of a certain length. Recipe for baked potatoes is now 2000 words long.

That's not the algorithm's fault per se; that's instead the fact that recipes can't be copyrighted, and so these sites can freely steal + repost one-another's recipes, and so you'll find the same recipe word-for-word on many sites, thus making an exact match in the recipe part not contribute highly to ranking any particular site. The 2000-word blog post, on the other hand, is actual Intellectual Property unique to the site posting it. So it only appears in the one place; and so when your query matches it, it ranks quite highly indeed.

> That's not the algorithm's fault per se;

Yes, it is. There are good recipe sites out there with authoritative, reliable content and fast loading times. Google says it prioritizes those things, I can identify sites that have them, and yet the algorithm doesn't favour them. That's the algorithm's fault no matter what memes about copyright law cause a proliferation of shitty websites.

What I'm saying is that the "recipe" part of a recipe website is a commodity – there is no "authoritative" source for a given recipe, unless that recipe is too niche in appeal to end up widely disseminated. This video (https://www.youtube.com/watch?v=SsNLzyqqINw) has a pretty good coverage of the topic.

Compare and contrast: phone-number directory listings. Who should Google cite as the authoritative source for lists of name-to-phone number associations? Nobody. All the lists are copying from each-other, curating and correcting the data taken from one-another, gathering their own original data for additions, and everything in between. Every portal overlaps every other portal, but mostly has the same stuff.

Compare and contrast, in the physical world: printings of public-domain literature. If Google indexed bookstores, which printing by which publisher would you want them to rank first on a search for e.g. Pride and Prejudice?

Try Kagi.com, you can rank domains however you want
What I really want is biased search results of my choosing.

$10 a month for a personal search is a bit much. $10 a month for work related search is cheap. Give me results specific to my industry without having a super long query.

That's what Kagi lenses are for. Just try...
(Neeva team member here) re: recipes. You might like the Neeva recipe search experience. You can see an entire recipe and reviews (without the ads or intro text) without navigating away from the search results page. Quick example here: https://neeva.com/search?q=baked+potato&src=nvobar
The last time this came up, Google demonstrated that it still worked. Most of the examples of it not working people tried to provide are actually just unexpected exact matches in the HTML that the standard user doesnt see, so they seem like false positives or "surprisingly good" results not based on the page content.
Approximate match allows you to sell approximate/related-match ads
Exactly. Even within the ad platform they keep pushing advertisers to target ‘broad’ keywords instead of ‘exact match’ ones.
Brilliant observation, never thought of it. Now the quality degradation kinda starts to make sense.
> What is the incentive to remove it and replace it with a needlessly more complex almost literal but still fuzzy search?

Control. They've moved from helping you find what you asked for, to trying to influence you to changingnwhat you ask for to the thing that paid them the most.

Similarly they're they're forcing creators to alter content to match their metrics or fall into obscurity.

Because somebody could have crafted a superior search using their refined search as an API, destroying the Add Revenue
They didn’t remove literal search. Put your literal in quotes.
This stopped working reliably some time before last year.
Eh, it's not so much that it stopped working as it is that it never worked the way you thought it did.

Quotes have ~always been an exact match on the tokenized query text, not a substring match on the corpus text. No synonyms, reordering, gaps, etc, but the matches -- and failures -- are sometimes not obvious at first blush.

If you search for "don't stop me now", for instance, that "don't" tokenizes to "don t", so it will match the tokenized strings "don't", "don t", "don-t", "don, t", etc ... but not "dont", because that's outside tokenization.

On the other hand, snippets mostly are substring matches of the query text, so if you see a result to a literal query that doesn't have a snippet, you know it's probably one of the weird matches.

This is just patently false in addition to being condescending.

If you use quotes around a phrase, it will reorder terms and make substitutions with synonyms in addition to straight up ignoring the quoted phrase no katter how many times you add +. If you then fiddle with settings (randomly not available depending on star alignment and device) to change it to 'verbatim' it will still reorder and split up tokens in the phrase.

Why is it that exact searches used to actually work reliably, then? What's changed?
This has changed:

  term -term
In the past that Google search returned no results for any search. Today the set of results are altered before they are presented to the user. Sometimes the set of reslts is empty and at other times it contains results.

For example:

  steve -steve    returns 0 results
  test -test      returns 4,780,000,000 in my search
                  starting with google/youtube videos
That's easy; there are synonyms added for test but not for steve.

If you search for ["test" -test] you'll get no results; quoting "test" removes the synonyms.

It's probably not new behavior per se, but synonyms have gotten a lot broader over the years, so it was a lot easier to punch [term -term] ten years ago and hit a term which had no synonyms.

I mean, I'm confident that the core of how it works -- "an exact match between the tokenized document and the tokenized query" -- hasn't changed in a very long time, but I can't really promise there wasn't another aspect I'm ignorant of that is responsible for the behavior you remember that changed somehow.

"Exact tokenized matching" can look like "exact string matching" a lot of the time. Until you hit some of the edge cases it's like kerning: https://xkcd.com/1015/