Hacker News new | ask | show | jobs
by ChuckMcM 4092 days ago
Sigh, this is incorrect.

edit: incorrect is perhaps too strong, it is incomplete.

While it is true that click tracking can be used as a relevance signal, the people who were really pissed off when the data stream got dumped were advertisers who wanted to buy AdWords. That was a very simple system, pay someone for clickstream data, extract trending queries, front those with AdWord buys to get your page on the top of Google's results, and profit.

Having built a search engine and run it for 5 years, we got to see what people felt was relevant and what wasn't in a very loose way with click stream data. Basically you have a query and 10 blue links you can split the results in quartiles and figure out if the thing they clicked on was top half, bottom half, top quarter/second quarter etc. And do A/B testing to see how that played out. But what we found was that the best indication of what a page was about, was the text that linked to it. If you have an in-link to a page which was "<href='page'>great radio site"[1] then "great radio site" would be a query that should return that page which might be titled something like "bob's electromagnetic spectrum imaginarium" or something equally unlikely to come up in a query string.

So the bottom line is that there are lots of ways to try to determine relevance, click stream data is a part of that but by no means the biggest factor.

[1] neutered html for obvious reasons.

3 comments

The value of looking at queries is that it allows learning what questions users ask. The front end of the search process is to infer from the query what the user really should be given. That's a machine learning problem. The head of Google search remarked recently that "as the search engine gets smarter, the queries get dumber".

This is reflected in Google's search results. A Google query which can possibly be interpreted as related to a popular culture item usually will be. Google has become more aggressive about this over the years. Their "Did you mean" result tag once offered an alternative for a second search. Now, they return results for the more popular interpretation first.

The back side of search, page quality and ranking, is weaker than many think. Links are less useful than they used to be. Most links to business sites are now from "social" sites or forums, which are easily spammed. Using social signals was a disaster back in 2012, when, for a few months, Google went all-in on social signals. Google tried to recognize sites that "look like spam", but everybody knows that now and spam sites look better than ever. (The same thing happened with spam emails a decade ago.) Google doesn't recognize provenance, so they can be fooled by scraper sites. Google doesn't recognize the business behind the web page, so they can be fooled by marginal businesses. There are even SEO companies using machine learning to reverse engineer Google's algorithms, to find out how far they can go with keyword stuffing before a penalty kicks in.

Google does far more manual adjustment than they did two years ago. There's an army of people doing manual ranking, and a smaller unit handing appeals from manual penalties. There was a time when Google boasted they did no manual adjustments to ranking. The automation is starting to fail.

1noon (Korean web search startup) tried to recognize provenance and was somewhat successful. But that wasn't enough to win in the market. Naver acquired 1noon.
But where's the competitive ecosystem in search? Innovation in search is restricted to few hundred people in Mountain View. And that's a tragedy.

What Google did for innovation in smartphone\tablet\browser they have gone and done the opposite for search.

Chuck, while blekko is a great search engine(especially due to custom search), it's clear that it is very different quality wise from Google.Same for Bing - it's not upto Google.And not for the lack of trying or money(bing).

So how do you think Google is succeeding so well, if it's not click stream data? and why can't it be maybe a combination of things that strongly depends on click stream data that others couldn't copy?

Actually if you do double blind tests you will find that Bing and Google are indistinguishable. We did this at Blekko earlier with our "3 card monte" gambit where you did a query, got back blekko, bing and google results, and got to pick the one with the "best" results for your query. Blekko usually won if it was query we had a slashtag for or if it was a "highly contested" query (lots of ad spend like "no fee credit card" or "cheapest insurance") In the former case our curation meant that more results were appropriate, and in the latter case our spam filtration left us with better results. If it was a general query for which we didn't have a category for, and it wasn't highly contested, google and bing split the results, often 40/40/20 sometimes as low as 35/35/30. And if it was a long tail query like "turnip growing in south philidelphia" or something very specific with few sites associtated with it, and we didn't have it in a slashtag, Google would "win" those. Microsoft borrowed our idea and did their whole "bing and decide" campaign.

Many people realize that if you put Google ads on Bing's results and Bing's ads on Google results the profitability would switch (not that I am entirely sure what that says other than having a credible search engine and top end Ad inventory is required to make excess money in search)

It will be interesting to see if Marissa gets back into the game with Yahoo when their agreement to use Bing results for Yahoo searches expires.

The interesting linkage is that you can't sell search advertising unless people send the search request to you, and if you're not the most common place that people search, you're unlikely to get first shot at advertising. You can "buy" traffic (that is called Paid Distribution) by putting your search box on people's web site, or causing someone's browser to send you search queries first, or paying a phone maker to send you all their search queries, but you have to make enough money from the ads to offset what you pay. And as I mentioned over the last 8 years Google has been paying more and more for their traffic (up to $968M last quarter) and very few entrants into the business are going to compete with that. If you already have a platform (like Mozilla has Firefox, Apple has the iPhone, Facebook has pretty much everyone's Facebook page) so you "own" the ingress point, you can leverage that with a good search engine to make a lot of revenue. But if you need to pay for access to the ingress point, and pay a big chunk to the ad provider, it is really hard to support a lot of infrastructure (which is proportionally expensive to index size). That is the constraint box of search today.

The interesting thing for me is that every quarter, of the last 16, Bing has been making more money per click and Google less, that cost equation is balancing out. That is going to put a lot of pressure on the non-core parts of Google.

To answer your question, Google succeeded well when capturing the value of linkage data to extract page relevance (the original Page Rank patent), they created an advertising incentive which made their algorithm break (you want a billion in-links to your page, no problem! say the black hat SEO folks). Google is still making tons of money on search but you can look at their performance over the last 4 years to see the air is coming out of the balloon. What comes next is still an open question.

I participated in a blind test between Google, Bing and Yahoo in my Information Retrieval class at a university, back in 2013. The results were: 1) Google, 2) Bing, 3) Yahoo - for every standard IR metric we thought of, which included NDCG@{1, 5, 10}, MRR, MAP.
Did the results get published? Were the queries "external" or "user generated"? We found it very informative to compare the results of relevance testers (which were people who were shown a query and a set of results) with users (which were people who actually generated the query and evaluated the results). I had hoped to get a study done to get more data on that.