I actually tried comparing 128-bit SIMD to the 64-bit performance and the difference was 2x. I only published the results for the 4x comparison, but it should be pretty easy to reproduce if you change the types in the non-SIMD code[1] from i32 -> i64.