| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ocfnash 2306 days ago

I love this write up, and while the solutions discussed are excellent, I think the general-purpose FM Index data structure might work even better. I confess I'd have to read this post more closely to be sure, but I find the FM Index data structure so appealing, I'm always looking for excuses to promote it!

The FM Index is ideally suited to the problem of repeatedly searching a large fixed corpus for many different short substrings, and achieves optimal time complexity: linear in the length of the substring, with excellent constants independent of the length of the corpus (!).

Some years ago undertook a very similar exercise to that of the author except using the leaked Adobe password data rather than the HIBP data, and found the FM Index worked well: http://olivernash.org/2014/01/03/dna-of-a-password-disaster/...