|
|
|
|
|
by pilgrimfff
1354 days ago
|
|
I was expecting the number of search results here to be much higher - like who cares if Google only serves the first million results out of a billion? Very interesting to see that Google will only serve a few hundred links when they claim to have hundreds of thousands of relevant results indexed. I'm very curious where Google is getting that count and why the reality is so different. Systematic overcounting? Suppressing hundreds of thousands of results? |
|
Specifically, counting requires very low memory. When data is spread across 10,000 computers, all of them counting returns just 10,000 numbers i.e. 4 bytes * 10,000 = 40KB. It's easy for 1 computer to count those 10,000. Even at 100,000 computers 400KB.
Merging sorted search results is extremely memory intensive. Even with just the Id+Score pair, let's say 8 bytes. To get the 10,000th search result, each computer needs to create a List of 10,000 results, thats 10,000 * 10,000 * 8 bytes = 800 MB. For the 100,000th search result 10,000 * 100,000 * 8 bytes = 8 GB. OR if your data grows to 100,000 computers, thats 100,000 * 100,000 * 8 bytes = 80 GB of intermediate results to process at the end.
As you can see this doesn't scale well. You're required to retain context (i.e. sessions) of the search in memory instead, and get the search engine to better coordinate across all 100,000 computers. This also has scaling limitations based on memory of the session, the number of computers, the number of sessions, and their TTL (someone can leave the search page open for day and hit "next page" - should the sessions still be open? Thats an answer each search engine has to decide).
The reality is, if a customer wants deep pagination, they are better suited to a full data dump (i.e. full table scan) or using an async search API, rather than a sync search API.