| HN Mirror

To be clear, operations in our test corpus that took X ms on average took 100X ms. Is that the statement you are looking for?

One thing that I think you're missing is that while we experienced a 100x gain for our application, our findings aren't strictly empirical in the sense that the gain is always 100x. In many applications it will be more. The insight in the blog post is analytical, not empirical. Specifically, most people (I think) would assume a text index to be document partitioned. In Riak it's not. On a fundamental level this means that AND operations take time O(max(|A|,|B|)) instead of O(min(|A|,|B|) where |A| and |B| are the sizes of results that match query A and B, respectively. If you pick words at random from a power law distribution (which is typical in all natural languages) you will more often than not see many orders of magnitude difference in the |A| and |B|. If you sample queries from real query streams, you will see the same. For some intuition, just think about a query which has a common tag (like "funny" used in the example) and something restricted (like a particular user). The first will typically be an understood fraction of the size of your corpus while the second will have size that is more like a constant.

Putting this all together, the savings that we find for moving the intersection where we did will only grow on a relative basis as we get more users and clips. This hack costs 2x the storage size but always give O(min(|A|,|B|) complexity results instead of O(max(|A|,|B|)). 100x win is just shorthand for saying that those two set sizes typically varied by two orders of magnitude because of power law distributions. But in academia, we refer to that as "really fucking big".

The "presort" option that we added will scale differently, but have equally dramatic impact because pagination is effectively broken in Riak search if you do not use relevance sort. In this case, the O(.) argument is murkier because one version is done entirely within the index (which is usually in RAM) and the other has to hit the disk. In formal engineering circles, we refer to this type of boost as "unfucking real".

So, we could have measured things with more precision, but this was such an obvious win that it was almost pointless to calibrate it further.