Hacker News new | ask | show | jobs
by sa46 1974 days ago
If there's even a rough order to the underlying data, I'll buy their claim. On ordered data, a Postgres block-range index (BRIN) is often several orders of magnitude smaller than a B-tree index.

If the data is random, I suspect you're right and the PGM index is no-better than a B-tree index. Most data does have an order and would probably see similar gains.

2 comments

Not only is there a rough order, they explicitly require the ability to meaningfully embed data into the reals, and the performance gains come from assuming those embeddings have simple delta distributions. I wouldn't be surprised if the technique is worse than useless when that assumption is violated.

Edit: I don't have time right now, but a toy example I like to throw at these kinds of problems is mapping primes to their indices (e.g. 2->0, 3->1, 5->2, ...). General-purpose learning algorithms can usually make a little headway with it, but not much, and only with substantial resources thrown at the problem. I'd be shocked if that toy example were any faster with their solution than a b-tree.

Due to the prime number theorem, your toy example actually has a very good approximated mapping.
Kind of. Appropriately normalizing the primes with some kind of function based on log(log(n)) would probably allow the article's technique to perform well (still struggling on larger primes -- you can't avoid needing a lot of bits to express those gaps), but it's precisely the shifting density which makes the problem hard to learn with any standard algorithm, and I don't think theirs would be an exception since they explicitly rely on gaps drawn from some fixed distribution.
I discovered BRIN indexes fairly recently. You trade odd a bit of speed (data & query dependant, of course), but potentially gain a huge amount of disk space vs B-Trees - the first time I created a BRIN index, I did a double take because I thought something must be wrong because the index was so small!