Hacker News new | ask | show | jobs
by yazaddaruvala 1355 days ago
The problem is generally called "deep pagination". It's extremely inefficient to compute.

Specifically, counting requires very low memory. When data is spread across 10,000 computers, all of them counting returns just 10,000 numbers i.e. 4 bytes * 10,000 = 40KB. It's easy for 1 computer to count those 10,000. Even at 100,000 computers 400KB.

Merging sorted search results is extremely memory intensive. Even with just the Id+Score pair, let's say 8 bytes. To get the 10,000th search result, each computer needs to create a List of 10,000 results, thats 10,000 * 10,000 * 8 bytes = 800 MB. For the 100,000th search result 10,000 * 100,000 * 8 bytes = 8 GB. OR if your data grows to 100,000 computers, thats 100,000 * 100,000 * 8 bytes = 80 GB of intermediate results to process at the end.

As you can see this doesn't scale well. You're required to retain context (i.e. sessions) of the search in memory instead, and get the search engine to better coordinate across all 100,000 computers. This also has scaling limitations based on memory of the session, the number of computers, the number of sessions, and their TTL (someone can leave the search page open for day and hit "next page" - should the sessions still be open? Thats an answer each search engine has to decide).

The reality is, if a customer wants deep pagination, they are better suited to a full data dump (i.e. full table scan) or using an async search API, rather than a sync search API.

1 comments

Well at that point, who really cares if the content of the 1001s page is deterministic, or in perfect order? Get the first 100 or so pages right, and thereafter just request the nth results from each of those m computers. No merge and no memory explosion, you'll just get them slightly out of order.
You still need to filter based on the other indexes. If you search for [bitcoin mining] you don't want to find pages related to coal mining. So this data still needs to be joined.
the search term for this is intersection. The posting lists for the two terms are intersected, then the results are ranked. But there are a lot more steps in a production search engine.

The long and short of it is if you really want the full results, just join google, join the search team, and then get enough experience so that you can do full queries over the docjoins directly. This was part of Norvig's pitch to attract researchers a while ago. For a research project, I built a regular expression that matched DNA sequences and spat out the list of all pages containing what looked like DNA and then annotated the pages so in principle you could have done dna:<whatever sequence> but obviously that was not a goal for the search team.