On platforms thar require aligned loads and stores (not x86 nor ARM), a direct pointer cast sometimes uses an aligned load/store where a memcpy uses multiple byte loads/stores, even on a good compiler, since memcpy() doesn't require that the pointers are aligned. This can be mitigated by going through a local variable, but it gets pretty verbose.
For MSVC you have to add "/Oi", otherwise it is always a function call at lower optimisation levels. Clang and GCC treat it as an intrinsic always, even in debug builds.