Hacker News new | ask | show | jobs
by kristoffer 2337 days ago
Well, on e.g. x86 you have "rep stos" instruction which pretty much amounts to memset in HW.

The operation needs to be blocking otherwise you could of course do it without CPU at all (DMA). But you still have to go through the memory controller and actually write each word of memory.

2 comments

Adding an ability to zero-out arbitrary range directly to dram chip is very simple addition HW-wise.

In fact all ones is not hard either, there were some experiments to give RAM ability to perform simple page-level computations with promising results.

Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.
Invalidating cache lines is not that hard, it happens all the time. Refilling them from DRAM takes time.
Do you have a link where I can read more?

I also wonder, for typical workloads, what percent of CPU time is spent zeroing pages.

https://parallel.princeton.edu/papers/micro19-gao.pdf

The amount of time wasted zeroing out memory pages in a typical OS is quite significant, also take into account that such an operation will trash perfectly good cache space for no good reason.

Why can't memory modules be smarter? Or wouldn't it make a difference? i.e. is the cpu-to-memory bus just as fast as any action exclusively inside the memory module itself could be?
You can only address one row/column at a time in a normal DDR memory. Start by reading this if you want to learn more: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

In the end memory design is limited by the laws of physics.

Ah, I never knew that was how it worked. Thanks! Still, it seems like there's a way it could be redesigned such that zeroing large blocks is done completely in the module and is faster, and provide an extra line for it. With the naive (and probably wrong) implementation I'm thinking, you'd get log N zeroing but bigger constant multiplier on lookup, so that would be the wrong compromise in just about every situation. Maybe you could generify it so that you could do various operations beyond just zeroing and that might be interesting? Though I'm sure this is a plenty well explored topic already, and has come up with what we have now.