Why can't memory modules be smarter? Or wouldn't it make a difference? i.e. is the cpu-to-memory bus just as fast as any action exclusively inside the memory module itself could be?
Ah, I never knew that was how it worked. Thanks! Still, it seems like there's a way it could be redesigned such that zeroing large blocks is done completely in the module and is faster, and provide an extra line for it. With the naive (and probably wrong) implementation I'm thinking, you'd get log N zeroing but bigger constant multiplier on lookup, so that would be the wrong compromise in just about every situation. Maybe you could generify it so that you could do various operations beyond just zeroing and that might be interesting? Though I'm sure this is a plenty well explored topic already, and has come up with what we have now.
In the end memory design is limited by the laws of physics.