Hacker News new | ask | show | jobs
by joe_the_user 1968 days ago
I don't get it. I've implemented B-trees. The majority of space the used by a B-tree is the data itself. Each N-ary leaf of the tree is a basically a vector of data with maybe some bookkeeping at the ends. The leaves are more than half of the tree.

Sure, you can compress the data. But that depends on the data, completely random data can't be compress. Other data can be. But a point blank 83x space claim seems bizarre - or it's comparing to a very inefficient implementation of a B-tree.

Edit: It seems the 83x claim is a product of the HN submission. I could not find it on the page. But even the page should say something like "a compressed index that allows full speed look-up" (akin to succinct data structures) and then it would make sense.

7 comments

So, from a quick read, I think there are a few things at play here that allow for "compression of random data".

One, and probably the biggest one, _this isn't lossless compression_. As other commenters mentioned, this aggregates groups of points into line segments and stores their slopes (allowing for a pre-specified error of up to epsilon).

Two, while the sample input data is randomly generated, it then needs to be sorted before it can be used here. This completely changes the distributional qualities (see: order statistics sampled from a uniform distribution [0]). Just as a toy example, suppose this was a million randomly-generated binary digits. Sure, you could store the million digits in sorted order, or you could just use run-length encoding and say "I have 499,968 zeroes and 500,032 ones" [1].

[0] https://en.wikipedia.org/wiki/Order_statistic#Order_statisti...

[1] I know, this is a dense sampling on the input space. But that's the sort of intuition that allows you to compress sorted data better than you'd be able to compress the unsorted data. The provided C++ code provides a sparse sampling.

The index does not store the data at all, it store slopes.

You start by sorting the data, make a piecewise linear interpolation, and you store each slope as triplet (key, slope, intercept) with key being the smallest value in the piecewise interpolation.

I find it quite clever to be honest.

I am not sure how it works on inserts and delete, didn't read the whole paper.

More digestible info on the slides.

I'm almost good at math. Could you enlighten me, when you say "slope", are you talking about the slope of a line as defined by "The Equation of a Straight line" [0], i.e. the "m" in "y=mx+b"?

And what do you mean by "intercept". Are you referring to the "b" in that same equation?

If that's true, in what way is that an optimization to storing "x" and "y" at each node?

[0] https://www.mathsisfun.com/equation_of_line.html

In a traditional tree you just store a bunch of y's. So let's say 7 points are just y0-y6. In a tree you would need at least 7 keys stored. In this you would just have m and b stored and you would use the formula to look up the index.
Thanks, that really helps get an intuition for how this works! It's pretty clever really.
As parent said, it is a piecewise linear interpolation, which means it is divided into line segments. Hence you don't need the parameters for the whole index, only for each segment, as the parameters are the same for each point in the same segment.
"piecewise linear interpolation"

After having linearly interpolated the points on the curve piecewise, instead of a smooth curve, you now have line segments that connect to other line segments, each point now connected to the next, not smoothly, but linearly, and each of those lines/segments can be optimized by storing only slope and intercept. Yes?

That seems both ingenious and very, very straight-forward. What is the reason behind us not having deployed massive amounts of search indices into the world that were built using this method? Has the drawback of this method been too great?

The drawback to me seems to be the amount of computation needed to produce such an index.

Hi @kreeben, the construction is very fast, it can be done in linear time with a single scan of the data. Just to give you some figures, we constructed a PGM-index on 10^12 key-value pairs (8 bytes + 8 bytes) loaded in main memory in less than 3 seconds.

About the information that we store, we prefer to say "piecewise linear approximation" because the line segments do not necessarily connect points, and they are not connected to each other. The guarantee of the piecewise linear approx is that each line segment is far from the input points by at most a user-given integer ε. With this guarantee, when we compute the approximate position p of a given query key q, we can always find the true position of q after a fast binary search in the range of positions [p-ε,p+ε].

The piecewise linear model is nothing more than a list of triples (key,slope,intercept). The key is needed because we need to know where a segment starts and where the previous one ends. The slope and the intercept are exactly as you said, the parameters that let us compute the approximate position via p = slope * q + intercept.

It is, arguably, deployed at scale, if you squint and call it delta compression. The reason its not more widely used is there are drawbacks. Updates can be more expensive, for instance: what was a single-key change perhaps blew up a 1000 key "line" and now all those keys need reshuffling. You also see more muted returns because generally keys in indexes point to something, so compressing the keys creates complexity in how you map each key to its value or reference. Compared to a solution that stores keys close to their values, a solution that packs keys together gets poor cache locality for resolving the values, and may need mapping tables.

Basically, delta compression is used all over the place, but it isnt a silver bullet.

> What is the reason behind us not having deployed massive amounts of search indices into the world that were built using this method?

There are a lot of other strategies for segmented data indexes. Many of which are roughly equivalent to this in practice. For example, using a computed strategy for distributing data among various nodes on a server. In such a situation, a master node can run a computation on an indexed value to identify the node(s) in the cluster which contain the data and send a query request to only the nodes containing that information.

I would think the biggest drawback to this particular strategy is, often times, you need to analyze the data in the index to get an answer anyway. So it's not really beneficial to throw away the index data. For example, if the indexed field is a timestamp, and the result set is often sorted on that field, then it makes total sense to keep the index values and use that to perform sorting while a separate thread fetches the data needed to flesh out the query.

If there's even a rough order to the underlying data, I'll buy their claim. On ordered data, a Postgres block-range index (BRIN) is often several orders of magnitude smaller than a B-tree index.

If the data is random, I suspect you're right and the PGM index is no-better than a B-tree index. Most data does have an order and would probably see similar gains.

Not only is there a rough order, they explicitly require the ability to meaningfully embed data into the reals, and the performance gains come from assuming those embeddings have simple delta distributions. I wouldn't be surprised if the technique is worse than useless when that assumption is violated.

Edit: I don't have time right now, but a toy example I like to throw at these kinds of problems is mapping primes to their indices (e.g. 2->0, 3->1, 5->2, ...). General-purpose learning algorithms can usually make a little headway with it, but not much, and only with substantial resources thrown at the problem. I'd be shocked if that toy example were any faster with their solution than a b-tree.

Due to the prime number theorem, your toy example actually has a very good approximated mapping.
Kind of. Appropriately normalizing the primes with some kind of function based on log(log(n)) would probably allow the article's technique to perform well (still struggling on larger primes -- you can't avoid needing a lot of bits to express those gaps), but it's precisely the shifting density which makes the problem hard to learn with any standard algorithm, and I don't think theirs would be an exception since they explicitly rely on gaps drawn from some fixed distribution.
I discovered BRIN indexes fairly recently. You trade odd a bit of speed (data & query dependant, of course), but potentially gain a huge amount of disk space vs B-Trees - the first time I created a BRIN index, I did a double take because I thought something must be wrong because the index was so small!
The size comparison here only concerns the size of index. That is the size of the data is excluded.
Parent means the size of the key data present in the index, I suspect. In a B-tree every byte of every key value is present.
They do it by not storing keys in the index. A B-tree has copies of all the keys in the tree, and (normally) also in the data. Here, they just have slopes that get you close to the right actual element.
> completely random data can't be compress

Given that a normal B-Tree can't retrieve the original order, it must be more compressible than random data and a representation that would let you represent a wrong order has invalid sequences, so it must be less space efficient than one that would use those sequences to mean something valid.

The paper says 83x, and even makes stronger claims that are fairly unqualified:

In short, the experimental achievements of the PGM-index are: (i) better space occupancy than the FITing-tree by up to 75% and than the CSS-tree by a factor 83×, with the same or better query time; (ii) uniform improvement of the performance of RMI in terms of query time and space occupancy, and 15× faster construction, while requiring no hyperparameter tuning; (iii) better query and update time than a B+-tree by up to 71% in various dynamic workloads while reducing its space occupancy by four orders of magnitude (from gigabytes to few megabytes).

I wonder why it was just four orders of magnitude, though. Why not six or twelve?

Have you examined the results presented in section 7? They take pains to qualify these claims.

If you are curious, the authors have provided completely reproducible experiments.

I didn't say that these specific claims are simply false. Just that they sure seem to hint at extraordinary improvements in a broad range of cases. The website itself also seems to me to be suggestive in just the same way, only much more so.

I myself don't think that that amounts to taking pains to qualify their claims. Other people will have to make their own minds up about whether or not that's the case.