| This is bad benchmarking. There is no way you're doing a b-tree lookup in microseconds on an on-disk file... Unless the parts you care about are already cached. So either the whole file fits in RAM and you pre-load it (in which case you have to account for that memory usage), or you have to run benchmarks on random hashes, which would yield much slower numbers (on the order of 30ms for an HDD). Personally, when I implemented this in a web service, I used a bloom filter. It has some false positives (tunable) and requires a few extra disk reads per check, but the resulting file is also smaller and the code to generate it and check it is very, very simple. https://gist.github.com/marcan/23e1ec416bf884dcd7f0e635ce5f2... P.S. if you need to sort a huge file, just literally use the UNIX/Linux `sort` command. No, it does not load it all into RAM. It knows how to do chunked sorts, dump temp files into /tmp, and then merge them. Old school UNIX tools are smarter than you think. |
This so much.
I’ve worked with many devs and admins that don’t understand the tools that they have at their disposal on their systems. They end up trying to reinvent the wheel and their solutions usually don’t consider all the edge cases