Hacker News new | ask | show | jobs
by siscia 1976 days ago
The index does not store the data at all, it store slopes.

You start by sorting the data, make a piecewise linear interpolation, and you store each slope as triplet (key, slope, intercept) with key being the smallest value in the piecewise interpolation.

I find it quite clever to be honest.

I am not sure how it works on inserts and delete, didn't read the whole paper.

More digestible info on the slides.

1 comments

I'm almost good at math. Could you enlighten me, when you say "slope", are you talking about the slope of a line as defined by "The Equation of a Straight line" [0], i.e. the "m" in "y=mx+b"?

And what do you mean by "intercept". Are you referring to the "b" in that same equation?

If that's true, in what way is that an optimization to storing "x" and "y" at each node?

[0] https://www.mathsisfun.com/equation_of_line.html

In a traditional tree you just store a bunch of y's. So let's say 7 points are just y0-y6. In a tree you would need at least 7 keys stored. In this you would just have m and b stored and you would use the formula to look up the index.
Thanks, that really helps get an intuition for how this works! It's pretty clever really.
As parent said, it is a piecewise linear interpolation, which means it is divided into line segments. Hence you don't need the parameters for the whole index, only for each segment, as the parameters are the same for each point in the same segment.
"piecewise linear interpolation"

After having linearly interpolated the points on the curve piecewise, instead of a smooth curve, you now have line segments that connect to other line segments, each point now connected to the next, not smoothly, but linearly, and each of those lines/segments can be optimized by storing only slope and intercept. Yes?

That seems both ingenious and very, very straight-forward. What is the reason behind us not having deployed massive amounts of search indices into the world that were built using this method? Has the drawback of this method been too great?

The drawback to me seems to be the amount of computation needed to produce such an index.

Hi @kreeben, the construction is very fast, it can be done in linear time with a single scan of the data. Just to give you some figures, we constructed a PGM-index on 10^12 key-value pairs (8 bytes + 8 bytes) loaded in main memory in less than 3 seconds.

About the information that we store, we prefer to say "piecewise linear approximation" because the line segments do not necessarily connect points, and they are not connected to each other. The guarantee of the piecewise linear approx is that each line segment is far from the input points by at most a user-given integer ε. With this guarantee, when we compute the approximate position p of a given query key q, we can always find the true position of q after a fast binary search in the range of positions [p-ε,p+ε].

The piecewise linear model is nothing more than a list of triples (key,slope,intercept). The key is needed because we need to know where a segment starts and where the previous one ends. The slope and the intercept are exactly as you said, the parameters that let us compute the approximate position via p = slope * q + intercept.

It is, arguably, deployed at scale, if you squint and call it delta compression. The reason its not more widely used is there are drawbacks. Updates can be more expensive, for instance: what was a single-key change perhaps blew up a 1000 key "line" and now all those keys need reshuffling. You also see more muted returns because generally keys in indexes point to something, so compressing the keys creates complexity in how you map each key to its value or reference. Compared to a solution that stores keys close to their values, a solution that packs keys together gets poor cache locality for resolving the values, and may need mapping tables.

Basically, delta compression is used all over the place, but it isnt a silver bullet.

> What is the reason behind us not having deployed massive amounts of search indices into the world that were built using this method?

There are a lot of other strategies for segmented data indexes. Many of which are roughly equivalent to this in practice. For example, using a computed strategy for distributing data among various nodes on a server. In such a situation, a master node can run a computation on an indexed value to identify the node(s) in the cluster which contain the data and send a query request to only the nodes containing that information.

I would think the biggest drawback to this particular strategy is, often times, you need to analyze the data in the index to get an answer anyway. So it's not really beneficial to throw away the index data. For example, if the indexed field is a timestamp, and the result set is often sorted on that field, then it makes total sense to keep the index values and use that to perform sorting while a separate thread fetches the data needed to flesh out the query.