| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gwf 5205 days ago

To be clear, operations in our test corpus that took X ms on average took 100X ms. Is that the statement you are looking for?

One thing that I think you're missing is that while we experienced a 100x gain for our application, our findings aren't strictly empirical in the sense that the gain is always 100x. In many applications it will be more. The insight in the blog post is analytical, not empirical. Specifically, most people (I think) would assume a text index to be document partitioned. In Riak it's not. On a fundamental level this means that AND operations take time O(max(|A|,|B|)) instead of O(min(|A|,|B|) where |A| and |B| are the sizes of results that match query A and B, respectively. If you pick words at random from a power law distribution (which is typical in all natural languages) you will more often than not see many orders of magnitude difference in the |A| and |B|. If you sample queries from real query streams, you will see the same. For some intuition, just think about a query which has a common tag (like "funny" used in the example) and something restricted (like a particular user). The first will typically be an understood fraction of the size of your corpus while the second will have size that is more like a constant.

Putting this all together, the savings that we find for moving the intersection where we did will only grow on a relative basis as we get more users and clips. This hack costs 2x the storage size but always give O(min(|A|,|B|) complexity results instead of O(max(|A|,|B|)). 100x win is just shorthand for saying that those two set sizes typically varied by two orders of magnitude because of power law distributions. But in academia, we refer to that as "really fucking big".

The "presort" option that we added will scale differently, but have equally dramatic impact because pagination is effectively broken in Riak search if you do not use relevance sort. In this case, the O(.) argument is murkier because one version is done entirely within the index (which is usually in RAM) and the other has to hit the disk. In formal engineering circles, we refer to this type of boost as "unfucking real".

So, we could have measured things with more precision, but this was such an obvious win that it was almost pointless to calibrate it further.

1 comments

tpsreport 5204 days ago

You have a knack for producing a lot of text with no supporting evidence. Especially of a quantitative nature.

This might go over well with a certain class of people, but it offends my engineering sensibilities.

link

Shalen 5204 days ago

Hi TPSReport - You realize that "gwf" is the Gary William Flake who has written an award winning complex book that is used in colleges worldwide as well as has run R&D for many companies like Microsoft, Yahoo, and Overture right? I'm sorry to say this but personally I'm finding your comments on this post as well other posts negative and lacking value. It would be great and helpful if you have any constructive/positive ideas, tips or experience to share to add value to the community. I'm sharing this in a positive constructive spirit. Thanks.

link

tpsreport 5204 days ago

Let me reiterate my constructive suggestion that the OP provide actual data. I can see how the author wrote a pop book, as he is quite verbose, but any engineer can tell you that he is also quite evasive on technical questions. "What's the baseline?" should not pose a tough question for anyone writing about performance improvements.

link