Hacker News new | ask | show | jobs
by magicalhippo 1965 days ago
I enjoyed glossing over the paper (will read it properly after work), but it was not immediately obvious to me how to implement this for strings.

I'm no expert in this area, so this might have an obvious answer.

I mean I guess you could treat characters as base 2^32 or something like that and convert a string to a real that way, but often strings have a non-trivial sort orders.

2 comments

You are right, variable-length strings are difficult. You could try to pack as many characters as possible in a computer word (or in a big int data type), say P characters, and then use the PGM-index to find the strings that share a prefix of P chars with the given query string. I discussed this solution in a GitHub issue (https://github.com/gvinciguerra/PGM-index/issues/8#issuecomm...). It may work in practice, but it's far from being adequate if compared to trie data structures (and their recent advancements).
Hi Giorgio, thanks for sharing!

You mentioned that tries have had recent advancements, could you please point me to a paper about these advancements so I can learn more? I did a quick Google search but wasn't able to find anything that seemed relevant.

Hi @Gh0stRAT, you are very welcome! For prefix search on strings, I recommend the classic String B-tree paper (https://dl.acm.org/doi/10.1145/301970.301973). Among recent results, there's the c-trie++ paper (https://arxiv.org/pdf/1904.07467.pdf) and the papers mentioned in their Related Work section.
If you do not need to do similarity searches, indexing over the (fixed size and numeric, possibly cryptographic) hash of the string might work.

edit: but that will map to an uniform range, making interpolation trivial. As this solution can be used for any type, I'm probably missing something. The index discussed in the paper can probably be used for range queries (which hashing would prevent), not just punctual searches.

edit: bah, I'm just reinventing an hash map.