| > The raw array solution doesn't work if you can't fit the entire thing in memory. Of course it works. You don't have to literally load the whole file into memory as an array. That's not how file I/O works! Just fseek() to the desired multiple-of-20 address and read 20 bytes. > You could try mmap'ing the file You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. They started on 64-bit machines and never noticed that mmap is limited to a ~256MB window on most 32-bit architectures. Certainly less than the 11GB needed for this problem! > There's nothing over engineered about a B-Tree. Yes there is. This guy hand-rolled 400+ lines of low-level data structure manipulation code just for the B-Tree manipulation. How much do you want to bet that he's got zero memory safety issues or other unsafety in that thing? Would you link this to your web app? Really, a C++ library? > That gives you answers in microseconds vs milliseconds I don't want to call him a liar... but now I have to. At the very least, he's made fatal mistakes in his benchmarking. There is no way that he can do random password lookups from on-disk structures in under 50 microseconds, unless he's got an Intel Optane SSD, and even then I would be highly suspicious. He's either cached the entire data structure in memory (at which point why bother with such a complex solution), or he's used the same set of passwords over and over for testing (which will falsely show a much lower latency than you would get with real passwords). So answer me this: How would this solution work on a typical auto-scale cluster of stateless web server VMs? Do you replicate 5-12GB to every VM every time this database changes? Or do you put one copy on a network share? Congratulations, you now have 1ms latency minimum! That fancy microsecond optimisation has just gone out the window... To be blunt: If your password-testing is the bottleneck to the point that a 50 microsecond latency is needed, then you're running a service that gets 20,000 password resets per second. At this point you're bigger than Office 365 and G Suite... combined, and you likely have much more important scaling concerns. |
You're right. The benchmarks are bad. I benchmarked the best and the worst case in scope of the original file, so I look up for the first and the last hash.
I totally missed that if I look for the same hash over and over again, I'd end up reading the same B-tree files parts, so they can be easily cached. That's probably the reason it seemed so fast.
No lies here, I just missed this (:
I'll rewrite benchmarks and update the results.