| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gvinciguerra 2011 days ago

Hello everyone. I'm Giorgio, the co-author of the PGM-index paper together with Paolo Ferragina.

First of all, I'd like to thank @hbrundage for sharing our work here and also all those interested in it. I'll do my best to answer any doubt in this thread.

Also, I'd like to mention two other related papers:

- "Why are learned indexes so effective?" presented at ICML 20, and co-authored with Paolo Ferragina and Fabrizio Lillo.

PDF, slides and video: http://pages.di.unipi.it/vinciguerra/publication/learned-ind...

TL;DR: In the VLDB 20 paper, we proved a (rather pessimistic) statement that "the PGM-index has the same worst-case query and space bounds of B-trees". Here, we show that actually, under some general assumptions on the input data, the PGM-index improves the space bounds of B-trees from O(n/B) to O(n/B^2) with high probability, where B is the disk page size.

- "A 'learned' approach to quicken and compress rank/select dictionaries" presented at ALENEX 21, and co-authored with Antonio Boffa and Paolo Ferragina.

PDF and code: http://pages.di.unipi.it/vinciguerra/publication/learned-ran...

TL;DR: You can use piecewise linear approximations to compress not only the index but the data too! We present a compressed bitvector/container supporting efficient rank and select queries, which is competitive with several well-established implementations of succinct data structures.

4 comments

magicalhippo 2011 days ago

I enjoyed glossing over the paper (will read it properly after work), but it was not immediately obvious to me how to implement this for strings.

I'm no expert in this area, so this might have an obvious answer.

I mean I guess you could treat characters as base 2^32 or something like that and convert a string to a real that way, but often strings have a non-trivial sort orders.

link

gvinciguerra 2011 days ago

You are right, variable-length strings are difficult. You could try to pack as many characters as possible in a computer word (or in a big int data type), say P characters, and then use the PGM-index to find the strings that share a prefix of P chars with the given query string. I discussed this solution in a GitHub issue (https://github.com/gvinciguerra/PGM-index/issues/8#issuecomm...). It may work in practice, but it's far from being adequate if compared to trie data structures (and their recent advancements).

link

Gh0stRAT 2011 days ago

Hi Giorgio, thanks for sharing!

You mentioned that tries have had recent advancements, could you please point me to a paper about these advancements so I can learn more? I did a quick Google search but wasn't able to find anything that seemed relevant.

link

gvinciguerra 2011 days ago

Hi @Gh0stRAT, you are very welcome! For prefix search on strings, I recommend the classic String B-tree paper (https://dl.acm.org/doi/10.1145/301970.301973). Among recent results, there's the c-trie++ paper (https://arxiv.org/pdf/1904.07467.pdf) and the papers mentioned in their Related Work section.

link

gpderetta 2011 days ago

If you do not need to do similarity searches, indexing over the (fixed size and numeric, possibly cryptographic) hash of the string might work.

edit: but that will map to an uniform range, making interpolation trivial. As this solution can be used for any type, I'm probably missing something. The index discussed in the paper can probably be used for range queries (which hashing would prevent), not just punctual searches.

edit: bah, I'm just reinventing an hash map.

link

ComodoHacker 2011 days ago

Can PGM index and linear approximation models in general be applied to clustered indexes, where actual data of variable size are stored in the index along with the keys?

link

gvinciguerra 2011 days ago

Yep, in that case you could use an indirection vector containing, for each key k, the offset to the first byte of k. This is what is typically done in B-trees, where the indirection vector is stored in the header of a disk page. It's described for example in Section 3.3 "Variable-length records" of Goetz Graefe's "Modern B-Tree Techniques".

link

BenoitP 2011 days ago

Thank you for your work!

Are there current efforts in your research going in mainstream RDBMS (say postgres)?

The space improvements are so great columns could just be indexed by default.

link

gvinciguerra 2011 days ago

Thank you so much for your interest, BenoitP!

Right now I'm focusing more on the design of compressed data structures. RDBMS are complex systems, and gaining sufficient knowledge of their internals would require several months of work. Though, it would wonderful for me to collaborate with some RDBMS engineers to integrate my current research efforts in their system.

Actually, some time ago, we asked a bachelor's student at the University of Pisa to integrate the PGM-index in Redis (which is simpler than an RDBMS). He did it, and the results were really promising, -3x overall memory usage with respect to Redis ZSETs.

link

virattara 2011 days ago

Does Savitch's theorem apply here? If yes then you can have a strict bound in section 7.1.

link

virattara 2010 days ago

Did I say something wrong?

link

gvinciguerra 2010 days ago

Not at all @virattara ;)

I don't have an answer right now, I just need to think about it more deeply

link

virattara 2010 days ago

ok ok

link