Hacker News new | ask | show | jobs
by ape4 1176 days ago
I believe the docs but I would have thought that memset() would be really quick - implemented in hardware?
4 comments

"Real quick" is human speak. For large amounts of memory it's still bound by RAM speed for a machine, which is much lower (a couple orders of magnitude I believe) than, say, cache speed. Things might be different if there was a RAM equivalent of SSD TRIM (making the RAM module zero itself without transferring lots of zeros across the bus), but there isn't.
I'm completely unfamiliar with how the CPU communicates with the memory modules, but is there not a way for the CPU to tell the memory modules to zero out a whole range of memory rather than one byte/sector/whatever-the-standard-unit-is at a time?

As I type this, I'm realizing how little I know about the protocol between the CPU and the memory modules--if anyone has an accessible link on the subject, I'd be grateful.

That's what I referred to as "TRIM for RAM". I'm not aware of it being a thing. And I don't know the protocol, but I'm also not sure it's just a matter of protocol. It might require additional circuitry per bit of memory that would increase the cost.
'trim' for RAM is a virtual to physical page table hack. Memory that isn't backed by a page is just a zero, it doesn't need to be initialized. Offhand it's supposed to be before it's handed to a process, but I don't know if there are E.G. mechanisms to use some spare cycles to proactively zero non-allocated memory that's a candidate for being attached to VM space.
Some processors have “hardware store elimination” that makes writing all zeros a bit faster than writing other values.

https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt...

No. Memset (and bzero) aren’t HW accelerated. There is a special CPU instruction that can do it but in practice it’s faster to do it in a loop. In user space you can frequently leverage SIMD instructions to speed it up (of course those aren’t available in the kernel because it avoids saving/restoring those and FP registers on every syscall (only when you switch contexts).

What could be interesting if there were a CPU instruction to tell the RAM to do it. Then you would avoid the memory bandwidth impact of freeing the memory. But I don’t think there’s any such instruction for the CPU/memory protocol even today. Not sure why.

That seems wild to be honest. I know how easy it is to say "well they can just.."

But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it? Actually slamming a bunch of 0's out over the bus seems so very suboptimal.

> But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it?

Having the memory controller or memory module do it is complicated somewhat because it needs to be coherent with the caches, needs to obey translation, etc. If you have the memory controller do it, it doesn't save bandwidth. But, on the other hand, with a write back cache, your zeroing may never need to get stored to memory at all.

Further, if you have the module do it, the module/sdram state machine needs to get more complicated... and if you just have one module on the channel, then you don't benefit in bandwidth, either.

A DMA controller can be set up to do it... but in practice this is usually more expensive on big CPUs than just letting a CPU do it.

It's not really tying up a processor because of superscalar, hyperthreading, etc, either; modern processors have an abundance of resources and what slows things doing is things that must be done serially or resources that are most contended (like the bus to memory).

Thanks for the answer!
Through modern CPUs are explicitly build to make sure such a loop is fast.

And in some cases on some systems the DRM controller might zero the memory in some situations, in which cases you could say it was done by hardware.

> DRM controller

Did you mean DMA controller? Or do you have more information?

yes DMA, not the direct rendering manager ;=)
dc zva?
really quick still doesn't mean it's free, especially if you always have to zero all the allocated pages even if the process might just have used part of the page.

Also the question is what is this % in relation to?

Probably that freeing get up to 5% slower, which is reasonable given that before you often could use idle time to zero many of the pages or might not have zeroed some of the pages at all (as they where never reused).