Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.
AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.
FSRM is a CPU feature embedded in the microcode (in this instance, amd-ucode) that software such as glibc cannot interact with. I refer to it as hardware because I consider microcode a part of the hardware.