Hacker News new | ask | show | jobs
by ridiculous_fish 2421 days ago
In practice one should typically prefer a suffix array to a suffix tree. A suffix array is conceptually so easy!

1. Given a set of strings, put all the suffixes into a set. For example "derp" -> "derp", "erp", "rp", "p".

2. Sort it

It's cheap! The number of suffixes equals the string length, and suffixes can be mere references into the string input. So in practice you use maybe ~2x memory compared to the input - and in big cache friendly chunks (not tries).

Now you're flying. If you want to perform a substring search:

1. Binary search on the first character. You get a range of suffixes which begin with that character.

2. Within that range, perform a binary search on the second character. Now you have a range of suffixes which begin with the first two characters.

3. etc.

Powerful and easy.

By the way this technique of going character by character can and should be extended to the sort itself; look up "three-way radix quicksort" for what you never learned in school.

1 comments

More like 4x or 8x the memory (of an ASCII string) if you just use pointers or size_t’s. Abstractly, you need at minimum (n log n) memory to represent a permutation, which is what you’re doing.