Hacker News new | ask | show | jobs
by pilgrimfff 1354 days ago
I was expecting the number of search results here to be much higher - like who cares if Google only serves the first million results out of a billion?

Very interesting to see that Google will only serve a few hundred links when they claim to have hundreds of thousands of relevant results indexed.

I'm very curious where Google is getting that count and why the reality is so different. Systematic overcounting? Suppressing hundreds of thousands of results?

2 comments

The problem is generally called "deep pagination". It's extremely inefficient to compute.

Specifically, counting requires very low memory. When data is spread across 10,000 computers, all of them counting returns just 10,000 numbers i.e. 4 bytes * 10,000 = 40KB. It's easy for 1 computer to count those 10,000. Even at 100,000 computers 400KB.

Merging sorted search results is extremely memory intensive. Even with just the Id+Score pair, let's say 8 bytes. To get the 10,000th search result, each computer needs to create a List of 10,000 results, thats 10,000 * 10,000 * 8 bytes = 800 MB. For the 100,000th search result 10,000 * 100,000 * 8 bytes = 8 GB. OR if your data grows to 100,000 computers, thats 100,000 * 100,000 * 8 bytes = 80 GB of intermediate results to process at the end.

As you can see this doesn't scale well. You're required to retain context (i.e. sessions) of the search in memory instead, and get the search engine to better coordinate across all 100,000 computers. This also has scaling limitations based on memory of the session, the number of computers, the number of sessions, and their TTL (someone can leave the search page open for day and hit "next page" - should the sessions still be open? Thats an answer each search engine has to decide).

The reality is, if a customer wants deep pagination, they are better suited to a full data dump (i.e. full table scan) or using an async search API, rather than a sync search API.

Well at that point, who really cares if the content of the 1001s page is deterministic, or in perfect order? Get the first 100 or so pages right, and thereafter just request the nth results from each of those m computers. No merge and no memory explosion, you'll just get them slightly out of order.
You still need to filter based on the other indexes. If you search for [bitcoin mining] you don't want to find pages related to coal mining. So this data still needs to be joined.
the search term for this is intersection. The posting lists for the two terms are intersected, then the results are ranked. But there are a lot more steps in a production search engine.

The long and short of it is if you really want the full results, just join google, join the search team, and then get enough experience so that you can do full queries over the docjoins directly. This was part of Norvig's pitch to attract researchers a while ago. For a research project, I built a regular expression that matched DNA sequences and spat out the list of all pages containing what looked like DNA and then annotated the pages so in principle you could have done dna:<whatever sequence> but obviously that was not a goal for the search team.

I used to work at Google but not in search, these are just my own guesses.

> where Google is getting that count

This is very likely a fairly accurate of the number of pages in Google's index that "match" the search query. Basically exactly what you would expect when you see the number.

> why the reality is so different

Cost reasons. Most search engines are more or less scanning down a sorted list of pages. The further you need to scan the more expensive it is. Just like running "OFFSET 1000" is usually slow in SQL. At some point the quality of results is generally very low and the cost is growing so it makes sense overall for Google to just cut it off to prevent it becoming an abuse vector (imagine just asking Google for the 10 millionth page of results for "cat").

The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

> The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

I used to (years and years ago) go past the first page pretty often, but results are so bad now that it rarely helps, so I almost never even click "2", let alone later pages. It's all gonna be obviously-irrelevant crap google "helpfully" found for me or the auto-generated spam that google used to try to fight (circa 2008 and earlier) but no longer seems to, just letting it gunk up and dominate up any results you get that aren't from a handful of top sites.

So this is in part one of those "we broke a thing and now no-one uses it, guess they didn't want it!"

What's really weird is that sometimes you get results that are outright repeating on those first N pages. Sometimes, more than once.

It's almost as if it tries to pad the output to be long enough that you'd lose patience before you reach the end of "effective pagination".

The thing has always been "broken". Google has had a page limit for at least a decade.
No, by "broken" I mean "let lazy auto-generated spam take over the results almost completely". So now those of us who did used to browse past page one (which, to be fair, may not have been many people) don't bother anymore.

[EDIT] For those who weren't around for it, Google used to play cat-n-mouse with spam-site operators. It'd go through cycles where results would slowly get worse, then suddenly a ton better, though never as bad as they are today. Around '08 or '09 they (evidently, I'm just judging from the search engine's behavior starting around then and continuing to this day) seemed to give up and just boosted a relatively small set of sites way up the results, abandoning the rest to the spammers.

Part of the difficulty is, if very few people are browsing to page 2, deciding what to put on page 2 becomes harder and harder.

Google has a lot of user behavior signals to decide what should be in results 1-10. Deciding if a page should be ranked 20, 200, or 2000 without any user clicks to check if you're right is really difficult.

I would bet that since 2008/9, the relative numbers of spam site operators, Google engineers, second-page searches have changed significantly.

Kagi has been working very well for me as an alternative
I find search results are frequently even worse than this, in that the first page will have nothing useful, with about three good links split between the second and third page. If I'm lucky.
If you've ever read Larry Niven's Fleet of Worlds series, there's a Bussard Ramjet with an AI programmed to hide any information that could help a hostile enemy/force find their way back to Earth.

A small cadre of humans who were raised by an Alien Race who came across a human seed ship cross paths with this Ramjet, and one of the protagonists realizes something is off when they do a query on the size of presentable search results in the astrographic/navigational dataset, and realizes that the number of starmaps the AI will produce is far smaller than the amount of space the system actually dedicates to storing said maps.

Point being, you can't trust any system that restricts results to a subset to not actually being designed to leave out results. and it furthermore makes a great, plausibly deniable way to drop search results... Force ranking to 10001+.

You'll forgive me, I'm sure, if I question a company well known for cooperating with an anti-humanitarian regime (Project Dragonfly) and that regularly black holes other undesirable datapoints, of engaging in less than up front search result presentation, I hope?

This isn't the revelation you act like it is. Because of course Google hides results. They don't pretend not to, and they even inform webmasters when it happens. The Search Console calls it a "Manual action" when they do so.

More importantly, the people asking for a "censorship-free search engine" are expressing an incoherent desire. The whole point of a search engine is to take the zillions of web pages that have matching keywords, push the crap to the bottom, and leave the gold on top. A system that does this is inherently censorious. We're just quibbling over what the criteria should be.

What our world lacks is a reasonably-quick way to hold Google accountable when they fail to represent the interests of the public who searches with them. The real-world consenquences of their filtering decisions need to filter back to the people making these decisions. Because "just don't make any filtering decisions" isn't going to result in a usable information retrieval system.

> "censorship-free search engine" are expressing an incoherent desire

That's not really true. `grep` is a censorship-free search engine. It just reports every matching result.

Of course that wouldn't generally be useful over the web, however even with sorting it is possible to be censorship free. You just need to include every matching result eventually.

Of course you would find that generating later pages likely also becomes expensive, so you may also add a page limit and ask the user to refine the query instead. Of course then you are back to this problem of it can be very difficult to find every result because you need to guess what words are on the page.

But all of this is basically moot because Google doesn't claim to be censorship-free so they have much simpler way of hiding results.

> even with sorting it is possible to be censorship free. You just need to include every matching result eventually

Do you honestly think that the people who complain about their favorite website being censored by Google would be satisfied with showing up on page 200*? I wouldn't.

It's only "not censorship" in the same sense that having your emails sent to the Spam folder isn't censorship. The spam folder, and low-scoring SERP results, are so full of items that every reasonable person acknowledges to be crap that getting banished to that area is pretty much equivalent to having someone blast your roadside protest with strobe lights and a sonic cannon. Surrounding you with so much garbage data that nobody can see or hear you any more is only "not censorship" on the dumbest technicality.

* Ignore, for sake of argument, the fact that page 200 won't even load in our universe. I'm imagining a parallel world where Google pretends to be censorship-free because they only push things far down in the results instead of removing them entirely.

"Do you honestly think that the people who complain about their favorite website being censored by Google would be satisfied with showing up on page 200*?"

My complaint has nothing to do with my favorite website. My complaint has to do with not being able to discover information and websites because Google won't allow me to dig very far into their search results. They're spidering the vast majority of the internet, and all I get are crumbs.

They're doing more than "push the crap to the bottom". They're pushing the crap to the bottom and then limiting how far you can dig into the pile. I am sometimes interested in that crap.
I agree. If you really want to see every result for a topic this system hurts you. However I think that use case is vanishingly rare. Most users would be better served by refining their query for what they are interested in than paging through hundreds of pages of results.

Google isn't designed to be a archive of every webpage matching a search result, it isn't what their infrastructure is optimized for.

"Google isn't designed to be a archive of every webpage matching a search result, it isn't what their infrastructure is optimized for."

I believe that's exactly what Google is. Limiting search results probably has to do with being able to serve more queries and respond quicker.

>The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

the fact is I just want the long tail or weird results to escape content farms, but I guess if it were possible for google to serve those content farms would spring up to game the long tail or weird results market.

Google tries to ignore them already, so the long tail is probably littered with old and mitigated content farms because they "match" but have a low page rank