|
|
|
|
|
by jasonthorsness
248 days ago
|
|
"We achieve 19.8 GB/s prefix sum throughput—1.8x faster than a naive implementation and 2.6x faster than FastPFoR" "FastPFoR is well-established in both industry and academia. However, on our target platform (Graviton4, SIMDe-compiled) it benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at ~10.8 GB/s." I thought the first bit was a typo but it was correct; the naive approach was faster than a "better" method. Another demonstration of how actually benchmarking on the target platform is important! |
|