| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by diamondlovesyou 972 days ago

Fast is relative here. These are microcoded instructions, which are generally terrible for latency: microcoded instructions don't get branch prediction benefits, nor OoO benefits (they lock the FE/scheduler while running). Small memcpy/moves are always latency bound, hence even if the HW supports "fast" rep store, you're better off not using them. L2 is wicked fast, and these copies are linear, so prediction will be good.

Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.

All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.

1 comments

js2 972 days ago

I'm not sure that your comment is responsive to the original post.

FSRM is fast on Intel, even with single byte strings. AMD claims to support FSRM with recent CPUs but performs poorly on small strings, so code which Just Works on Intel has a performance regression when running on AMD.

Now here you're saying `REP MOVSB` shouldn't be used on AMD with small strings. In that case, AMD CPUs shouldn't advertise FSRM. As long as they're advertising it, it shouldn't perform worse than the alternative.

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515

https://sourceware.org/bugzilla/show_bug.cgi?id=30994

I'm not a CPU expert so perhaps I'm misinterpreting you and we're talking past each other. If so, please clarify.