| Hello everyone. I'm Giorgio, the co-author of the PGM-index paper together with Paolo Ferragina. First of all, I'd like to thank @hbrundage for sharing our work here and also all those interested in it. I'll do my best to answer any doubt in this thread. Also, I'd like to mention two other related papers: - "Why are learned indexes so effective?" presented at ICML 20, and co-authored with Paolo Ferragina and Fabrizio Lillo. PDF, slides and video: http://pages.di.unipi.it/vinciguerra/publication/learned-ind... TL;DR: In the VLDB 20 paper, we proved a (rather pessimistic) statement that "the PGM-index has the same worst-case query and space bounds of B-trees". Here, we show that actually, under some general assumptions on the input data, the PGM-index improves the space bounds of B-trees from O(n/B) to O(n/B^2) with high probability, where B is the disk page size. - "A 'learned' approach to quicken and compress rank/select dictionaries" presented at ALENEX 21, and co-authored with Antonio Boffa and Paolo Ferragina. PDF and code: http://pages.di.unipi.it/vinciguerra/publication/learned-ran... TL;DR: You can use piecewise linear approximations to compress not only the index but the data too! We present a compressed bitvector/container supporting efficient rank and select queries, which is competitive with several well-established implementations of succinct data structures. |
I'm no expert in this area, so this might have an obvious answer.
I mean I guess you could treat characters as base 2^32 or something like that and convert a string to a real that way, but often strings have a non-trivial sort orders.