| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ape4 1176 days ago
	I believe the docs but I would have thought that memset() would be really quick - implemented in hardware?

4 comments

dataflow 1175 days ago

"Real quick" is human speak. For large amounts of memory it's still bound by RAM speed for a machine, which is much lower (a couple orders of magnitude I believe) than, say, cache speed. Things might be different if there was a RAM equivalent of SSD TRIM (making the RAM module zero itself without transferring lots of zeros across the bus), but there isn't.

throwaway894345 1175 days ago

I'm completely unfamiliar with how the CPU communicates with the memory modules, but is there not a way for the CPU to tell the memory modules to zero out a whole range of memory rather than one byte/sector/whatever-the-standard-unit-is at a time?

As I type this, I'm realizing how little I know about the protocol between the CPU and the memory modules--if anyone has an accessible link on the subject, I'd be grateful.

dataflow 1175 days ago

That's what I referred to as "TRIM for RAM". I'm not aware of it being a thing. And I don't know the protocol, but I'm also not sure it's just a matter of protocol. It might require additional circuitry per bit of memory that would increase the cost.

mjevans 1175 days ago

'trim' for RAM is a virtual to physical page table hack. Memory that isn't backed by a page is just a zero, it doesn't need to be initialized. Offhand it's supposed to be before it's handed to a process, but I don't know if there are E.G. mechanisms to use some spare cycles to proactively zero non-allocated memory that's a candidate for being attached to VM space.

andrewf 1175 days ago

Oldie but a goodie: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

MarkSweep 1175 days ago

Some processors have “hardware store elimination” that makes writing all zeros a bit faster than writing other values.

https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt...

vlovich123 1176 days ago

No. Memset (and bzero) aren’t HW accelerated. There is a special CPU instruction that can do it but in practice it’s faster to do it in a loop. In user space you can frequently leverage SIMD instructions to speed it up (of course those aren’t available in the kernel because it avoids saving/restoring those and FP registers on every syscall (only when you switch contexts).

What could be interesting if there were a CPU instruction to tell the RAM to do it. Then you would avoid the memory bandwidth impact of freeing the memory. But I don’t think there’s any such instruction for the CPU/memory protocol even today. Not sure why.

Arrath 1175 days ago

That seems wild to be honest. I know how easy it is to say "well they can just.."

But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it? Actually slamming a bunch of 0's out over the bus seems so very suboptimal.

mlyle 1175 days ago

> But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it?

Having the memory controller or memory module do it is complicated somewhat because it needs to be coherent with the caches, needs to obey translation, etc. If you have the memory controller do it, it doesn't save bandwidth. But, on the other hand, with a write back cache, your zeroing may never need to get stored to memory at all.

Further, if you have the module do it, the module/sdram state machine needs to get more complicated... and if you just have one module on the channel, then you don't benefit in bandwidth, either.

A DMA controller can be set up to do it... but in practice this is usually more expensive on big CPUs than just letting a CPU do it.

It's not really tying up a processor because of superscalar, hyperthreading, etc, either; modern processors have an abundance of resources and what slows things doing is things that must be done serially or resources that are most contended (like the bus to memory).

Arrath 1175 days ago

Thanks for the answer!

dathinab 1175 days ago

Through modern CPUs are explicitly build to make sure such a loop is fast.

And in some cases on some systems the DRM controller might zero the memory in some situations, in which cases you could say it was done by hardware.

pflanze 1175 days ago

> DRM controller

Did you mean DMA controller? Or do you have more information?

dathinab 1175 days ago

yes DMA, not the direct rendering manager ;=)

saagarjha 1175 days ago

dc zva?

dathinab 1175 days ago

really quick still doesn't mean it's free, especially if you always have to zero all the allocated pages even if the process might just have used part of the page.

Also the question is what is this % in relation to?

Probably that freeing get up to 5% slower, which is reasonable given that before you often could use idle time to zero many of the pages or might not have zeroed some of the pages at all (as they where never reused).