Hacker News new | ask | show | jobs
by legulere 1966 days ago
In the example they sort the data array. Does that mean this works just on sorted arrays? Insert and delete performance would be horrible I guess.
5 comments

Hi @legulere!

Yep, the example of Figure 2 shows only a static PGM-index on a sorted array.

Insertion and deletions are discussed in Section 3 "Dynamic PGM-index" and experimented in Section 7.3.

The Dynamic PGM-index is open-source too: you can find the implementation at https://github.com/gvinciguerra/PGM-index/blob/master/includ... and the documentation at https://pgm.di.unipi.it/docs/cpp-reference/#classpgm_1_1_dyn...

They propose a solution for dynamic PGM indexes in the paper (section 3) and benchmark it (section 6). A summary is that, in their benchmark, their index is faster by 13%-71% in most cases, but can be slower (1%-15.2%) in a few cases.

I agree the example would be more eye-catching without that sort.

http://www.vldb.org/pvldb/vol13/p1162-ferragina.pdf

In the full paper they quote a rather interesting method [1] that allows you to insert values in amortized O(log(n)) time (deletes are apparently handled with tombstones, presumably rebuilding the whole thing when a sufficiently large proportion is deleted).

A very abridged explanation of how they handle inserts: you split the collection in a list of collections where position k contains either nothing or a collection size 2^k. When you want to add a new value you find the first empty spot and fill it by building a set of your new value together with the collections of all the preceding spots (because the sizes are all sequential powers of two this will fit exactly). Provided that merging the collections takes linear time this takes an amortized O(log(n)) per inserted item.

Of course once you have this you can use it for any learned index that can be learned in linear time.

[1]: M. H. Overmars. The Design of Dynamic Data Structures, volume 156 of Lecture Notes in Computer Science. Springer, 1983.

Skimming the paper it appears so. B-trees also require sorted data.
B-trees don't need sorted data. The B-tree insertion algorithm performs the sorting for you.
B-trees need sortable data; GP is not saying that the data has to be pre-sorted, but that it can't be unorderable or otherwise uncomparable.
Isn’t sorting required for range based indexes?
Not necessarily. If you have indexing structures for data types that do not have a total order, only a partial order, you can store and do an indexed range search on data types that do have a total order. The primary implication is that the output of the range search will not reflect the total order in the way it would for a traditional B+Tree.
Same question. Can you give me an example of an indexing scheme that works on non-sortable data? I can’t think of any.
The canonical example is indexing rectangles. They have no total order. It is far from the only example. Any data type where equality and intersection are not equivalent test functions will effectively be non-sortable.

There are many indexing schemes for data with these properties. They focus on topological relationships rather than order relationships.

Do you know of any blog/papers which talks about this - using topology for such interval data types.