Hacker News new | ask | show | jobs
by dkyc 2532 days ago
> Yes, Google has a huge index, but most queries aren't in the long tail.

I'm not quite sure about that. 15% of Google searches per day are unique, as in, Google has never seen them before. [1]. That's quite an insane number.

[1] https://searchengineland.com/google-reaffirms-15-searches-ne...

9 comments

Sharing for anyone who didn't know there is a very good dataset you can use now. If you don't have a nvme ssd in your computer, I highly recommend getting one for fast i/o.

http://commoncrawl.org/ http://commoncrawl.org/the-data/ http://index.commoncrawl.org/

related.. Mark's blog is amazing and worth more than any data science degree imho.

https://tech.marksblogg.com/petabytes-of-website-data-spark-... https://tech.marksblogg.com

wow, thanks.

[edit] in my experience yacy works really well. You have it crawl the sites you frequently visit and their external links and it quickly accumulates to something more accurate than google.

Wow, 15% unique searches is indeed quite an interesting figure. With that said, what OP said is definitely not disproved. Just because 15% of searches are unique, that doesn't mean the most relevant result is buried in the tail end. I mean I can think of loads of my own searches that are probably unique or rare, but lead to the same popular results because of typos, improper wording etc.

Without some clear numbers on that from a major search engine, I think this might be very difficulty to infer.

Especially with voice searches. People are searching entire sentences rather than specific keywords which are much more likely to be unique.
Do people do this?

Or do you mean queries forwarded by home assistants trying to parse inputs?

> Do people do this?

The calling card of the developer realising that real users never act like you expect :)

Real users will use your product in ways you never imagined.
Heh, yes, they do. Which is a reminder that devs are not "typical" users.

As a developer, I search using keywords; for example, if I was looking for property for sale in Inverness, I might search for "property Inverness", whereas I've seen and heard "typical" users use something like "find me a 2 bedroom house with a garden for sale in the North of Inverness" - much more verbose, and containing stop words and phrases unlikely to help (I think!).

I do the same as you, but was just thinking that if most users search using full sentences then Google will spend most effort optimizing for that, so maybe we're the ones getting the worse results?
No, the optimization they do for the low-quality query is more than balanced out by the higher clarity and relevance of a well-phrased query. There are often extraneous words that aren't simple stop words, and they're not 100% successful at removing these extraneous ones.
I almost always search keywords while my girlfriend uses sentences and we often get quite different results. If I'm having trouble finding a good result there's a pretty good chance she will find something quickly. Surprisingly this holds true even for programming questions on topics that I know well and she's never heard of before.
> As a developer, I search using keywords;

So did I.

Around the time I left Google behind I had started to search like my wife did, using full sentences. it sometimes worked better I think.

With voice I use sentences: it's far more reliable because of the Markov model (or whatever predictive model they are using).
What does it matter whether it came from an assistant or not?

Natural language is likely the preferred search input method for kids under a certain age, who cannot yet type fluently. My kids formulate very long, complex queries verbally. The other day my son asked Alexa why the machine gun is such a deadly weapon. She replied with a snippet from Wikipedia that was surprisingly relevant.

I often do full sentences and then start deleting words from it if it doesn't work.
can confirm. i search full sentences even from the keyboard
I search full sentences (questions) from the keyboard. I figure I'm not the only to have had the question before, so I ask. Also, I find that blog posts, etc. tend to match well for full sentences.
Yes, sorry - that's me. Copy and pasting Sharepoint error messages
Those searches are unlikely to be unique.
Hmmm - Error: System.InvalidOperationException: The workflow with id=15f08b34-33f5-4063-8dea-d4ca6212c0d6 is no longer available.

is not atypical.

Does that actually work? I must be old school, I always delete such IDs before searching, but then again I used Google back when it actually did what you told it instead of misinterpreting everything for you.
It doesn't seem to have any particular effect on the results that come up. I always used to delete them, and still do sometimes but Google seems to pretty much ignore them in practice.
Which is a wonderful behavior except for all the times that the error numbers are not actually GUIDs but rather identify general errors.
If only :(
Could this be explained by supposing that people are just searching for current events, sometimes national, sometimes international, sometimes very local? If so, you really wouldn't need much indexed to handle those queries. I imagine many queries are also just overly verbose and sentence-length, which artificially inflates the number of unique queries which are actually seeking roughly the same pages.
Good point and 15% is indeed much, but the question would be what "unique" means. If it means that the exact same character sequence appeared for the first time, it doesn't mean that the users searches for a term that has never been searched for.

I mean with the newest advantages like machine learning it's more and more possible to _semantically_ link queries. If that's the case, those 15% could become 5% truly unique searches or even less.

"how dumb is trump" and "how dumb is donald trump" are two different searches but they semantically belong together because they mean the same.

How many of those are confirmed to be of human origin?
Probably quite a few. New things happen. Politics, wars, famous folks, movies, music, diseases, scientific studies, products, brands, model numbers for products, fads and slang. I'm guessing there are other things as well.

Some of the new things are probably variation as well - as others have mentioned, sentences and voice commands can give lots of new stuff.

Now I feel bad for putting gibberish like jsjsjdkktkwoapaoalf in my address bar and searching Google to test if my internet is working..
I just type "test", hopefully they do that too and it is ignored.
I do that all the time, I wonder how common that is?
I would think it’s pretty common. For a lot of people google is the internet. Or at least the reference. If google isn't working it’s almost certain it’s your end. I don’t think anyone else has that reputation for availability amongst the general public.
I think they mean that the results are still from the top pages of the internet. They mean long tail of visited pages, not long tail of searches.

A unique search query could still land you on Wikipedia.

> 15% of Google searches per day are unique, as in, Google has never seen them before.

That is impossible, and therefore wrong (I'm wrong, please see below). To know if a search is unique, as in Google has never seen them before, Google must be able to decide if a query it receives was seen before or not. Even if we assume Google needed only one bit for each message it has ever seen, and assuming it only saw 15% of new messages each day since its creation more than 20 years ago, it would need to store more than 2^1471 bits.

What could be true is that each day 15% of all searches are unique on that day.

Edit: I'm wrong. The 15% of completely unique messages per day are in regards to the messages per day, and not in regards to all messages it has ever seen, therefore exponential growth doesn't apply. To see that, assume Google just received one search query each day for 20 years but it was unique random gibberish, then Google could easily save that even though 100% of all messages per day are unique.

This is somewhat a faulty analysis. One could easily use a high accuracy bloom filter to store whether a search has definitely not been seen before, and that would be an estimate on the lower bound of the error margin.
Yup. This was actually an interview question I got from a former Google search engineer.
Where are you getting these numbers? Google says they get ~2 trillion searches per year. 40 trillion searches over 20 years (way too many) would be 2^44 searches. https://searchengineland.com/google-now-handles-2-999-trilli...

(And they don’t even need to store all searches for all time for this, thanks to Bloom filters.)

The whole point was that 2^1471 is wrong.
It is roughly 1.15^(365*20). That it is wrong was clear from its size. I wanted to use it's falseness to show that the assumptions are incorrect. Which they are, just not how I understood initially.
How are you computing that number? It's definitely wrong.

Assume Google receives 1 trillion queries per year, and has been around for 20 years. Using a bloom filter you can achieve a 1% error rate with ~10 bits per item. So a 200 terabyte bloom filter would be more than sufficient to estimate the number of unique queries.

A Bloom filter is just way overkill.

If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.

uh, is this sarcasm?

A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?

Actually the bloom filter was just an approachable example. There are much more clever and space efficient solutions to this problem, such as HyperLogLog [1] (speculating purely based on the numbers in that article, it looks like a few megabytes of space would be far more than sufficient). See the Wikipedia page on the "Count-distinct problem" [2].

1: https://en.wikipedia.org/wiki/HyperLogLog 2: https://en.wikipedia.org/wiki/Count-distinct_problem

My initial approach was also technically wrong; it tells you the fraction of queries which happen once.

To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.

You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.

I don't think it's necessarily impossible to calculate. Using probabilistic data structures arranged in a clever way, it's likely possible to calculate with some degree of accuracy.

I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.

The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.

I believe that they're unique in a sense that nobody has typed in that exact query previously.

Of course, Google knows better but to treat every search query literally. Slight deviations and synonyms work for the majority of the people, even if us techies highly oppose them and look for alternative solutions (like DDG) that still treat our searches quite literally.