AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.
FSRM is a CPU feature embedded in the microcode (in this instance, amd-ucode) that software such as glibc cannot interact with. I refer to it as hardware because I consider microcode a part of the hardware.