| This article has a mistake. I actually ran the benchmark, and it doesn't return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot. As I write this comment, the article's numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load): % rm -f benchmark && make && file benchmark && ./benchmark
c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
benchmark: Mach-O 64-bit executable arm64
minify : 1.02483 GB/s
validate: inf GB/s
% rm -f benchmark && arch -x86_64 make && file benchmark && ./benchmark
c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
benchmark: Mach-O 64-bit executable x86_64
minify : 4.44489 GB/s
validate: 5.3981 GB/s
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don't suspect it's running under an emulator.Update, I re-ran with the improvements from downthread (credit messe and tedd4u): % rm -f benchmark && make && file benchmark && ./benchmark
c++ -Oz -o benchmark benchmark.cpp simdjson.cpp -std=c++11 -DSIMDJSON_IMPLEMENTATION_ARM64=1
benchmark: Mach-O 64-bit executable arm64
minify : 6.7234 GB/s
validate: 17.7723 GB/s
Note that my version also uses a nanosecond precision timer `clock_gettime_nsec_np(CLOCK_UPTIME_RAW)` because I was trying to debug the earlier broken version.That puts Intel at 1.16x and 1.07x for this specific test, not the 1.8x and 3.5x claimed in the article. Also I took a quick glance at the generated NEON for validateUtf8 and it doesn't look very well interleaved for four execution units. I bet there's still M1 perf on the table here. |