Hacker News new | ask | show | jobs
by diamondlovesyou 932 days ago
AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.
2 comments

Isn't the high startup cost what FSRM is intended to solve?

> With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally added to AMD’s CPU functions analog to Intel’s X86_FEATURE_FSRM. Intel had already introduced this in 2017 with the Ice Lake Client microarchitecture. But now AMD is obviously using this feature to increase the performance of REP MOVSB for short and very short operations. This improvement applies to Intel for string lengths between 1 and 128 bytes and one can assume that AMD’s implementation will look the same for compatibility reasons.

https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...

Fast is relative here. These are microcoded instructions, which are generally terrible for latency: microcoded instructions don't get branch prediction benefits, nor OoO benefits (they lock the FE/scheduler while running). Small memcpy/moves are always latency bound, hence even if the HW supports "fast" rep store, you're better off not using them. L2 is wicked fast, and these copies are linear, so prediction will be good.

Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.

All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.

I'm not sure that your comment is responsive to the original post.

FSRM is fast on Intel, even with single byte strings. AMD claims to support FSRM with recent CPUs but performs poorly on small strings, so code which Just Works on Intel has a performance regression when running on AMD.

Now here you're saying `REP MOVSB` shouldn't be used on AMD with small strings. In that case, AMD CPUs shouldn't advertise FSRM. As long as they're advertising it, it shouldn't perform worse than the alternative.

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515

https://sourceware.org/bugzilla/show_bug.cgi?id=30994

I'm not a CPU expert so perhaps I'm misinterpreting you and we're talking past each other. If so, please clarify.

Or you leave it as is forcing AMD to fix their shit. "fast string mode" has been strongly hinted as _the_ optimal way over 30 years ago with Pentium Pro, further enforced over 10 years ago with ERMSB and FSRM 4 years ago. AMD get with the program.
rep movsb might have been fast at one point but it definitely was not for a few decades in the middle, where vector stores were the fastest way to implement memcpy. Intel decided that they should probably make it fast again and they have slowly made it competitive with the extensions you’ve mentioned. But for processors that don’t support it, using rep movsb is going to be slow and probably not something you’d want to pick unless you have weird constraints (binary size?)