|
|
|
|
|
by reubenmorais
2307 days ago
|
|
1. The raw array solution doesn't work if you can't fit the entire thing in memory. You could try mmap'ing the file and letting the kernel do the paging for you, I guess. 2. There's nothing over engineered about a B-Tree. It's a simple, well understood data structure which solves this exact problem. 3. There are meaningful differences in what you can do with a solution that gives you answers in microseconds vs milliseconds, as the blogger mentions. |
|
Of course it works. You don't have to literally load the whole file into memory as an array. That's not how file I/O works! Just fseek() to the desired multiple-of-20 address and read 20 bytes.
> You could try mmap'ing the file
You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. They started on 64-bit machines and never noticed that mmap is limited to a ~256MB window on most 32-bit architectures. Certainly less than the 11GB needed for this problem!
> There's nothing over engineered about a B-Tree.
Yes there is. This guy hand-rolled 400+ lines of low-level data structure manipulation code just for the B-Tree manipulation. How much do you want to bet that he's got zero memory safety issues or other unsafety in that thing?
Would you link this to your web app? Really, a C++ library?
> That gives you answers in microseconds vs milliseconds
I don't want to call him a liar... but now I have to. At the very least, he's made fatal mistakes in his benchmarking.
There is no way that he can do random password lookups from on-disk structures in under 50 microseconds, unless he's got an Intel Optane SSD, and even then I would be highly suspicious.
He's either cached the entire data structure in memory (at which point why bother with such a complex solution), or he's used the same set of passwords over and over for testing (which will falsely show a much lower latency than you would get with real passwords).
So answer me this: How would this solution work on a typical auto-scale cluster of stateless web server VMs? Do you replicate 5-12GB to every VM every time this database changes? Or do you put one copy on a network share? Congratulations, you now have 1ms latency minimum! That fancy microsecond optimisation has just gone out the window...
To be blunt: If your password-testing is the bottleneck to the point that a 50 microsecond latency is needed, then you're running a service that gets 20,000 password resets per second. At this point you're bigger than Office 365 and G Suite... combined, and you likely have much more important scaling concerns.