Hacker News new | ask | show | jobs
by jltsiren 1974 days ago
Some context: A few years ago, there was a paper from Google (https://dl.acm.org/doi/10.1145/3183713.3196909) that made learned data structures popular for a while. They started from the idea that indexes such as B-trees approximate an increasing function with one-sided error. By using that perspective and allowing two-sided error, they were able to make the index very small (and consequently quite fast).

Many data structure researchers got interested in the idea and developed a number of improvements. The PGM-index is one of those. Its main idea is to use piecewise linear approximations (that can be built in a single quick pass over the data) instead of the machine learning black box the Google paper was using.

3 comments

Hi Jouni!

You may find interesting these other papers of ours:

- The ALENEX21 paper "A 'learned' approach to quicken and compress rank/select dictionaries" (http://pages.di.unipi.it/vinciguerra/publication/learned-ran..., https://github.com/gvinciguerra/la_vector), where we introduce a compressed bitvector supporting efficient rank and select queries, which is competitive with several well-established implementations of succinct data structures.

- The ICML20 paper "Why are learned indexes so effective?" (http://pages.di.unipi.it/vinciguerra/publication/learned-ind...) where we prove that, under some general assumptions on the input data, the space of the PGM-index is actually O(n/B^2) whp (versus Θ(n/B) of classic B-trees).

Only having skimmed the work, read the following as a somewhat educated guess.

I think one could see this similar to repeated applications of interpolation search. If you are looking for x in a sorted array of n numbers between a and b, then index (x - a) / (b - a) * (n - 1) would be a good guess assuming uniform distribution of the numbers.

But as one can not assume a uniform distribution in general, one does that repeatedly. The first interpolation leads to better interpolation coefficients for the relevant subrange, which may lead to even better interpolation coefficient for an even smaller subrange until one eventually finds what one was looking for.

If there is no structure in the data that can be exploited, this degenerates into a more or less ordinary tree as we one certainly fit a line through two points, but if at some level a larger range of the data can be well approximated with the interpolation function, then it can save space and search time because one can get close to all the values in the range with only one set of interpolation coefficients and only one interpolation.

Isn't it sometimes better to assume data has a structure rather than having less performance but knowing it works equally well with unstructured semi-random data?
But if you are inventing or implementing a general purpose data structure, then you can not assume any structure by definition, especially not when it comes to worst case performance. On the other hand you can certainly include special handling of common special cases that will improve the performance or even come up with special data structures that only work under specific circumstances but then outperform general purpose solutions. As always it is a trade-off, here between generality and performance.
An interesting case in which Google trolled people into action by publishing a purely hypothetical paper for which there's no evidence they intended to put it into actual practice.