Hacker News new | ask | show | jobs
by dreamcompiler 1478 days ago
The other issue is instruction-level parallelism, as another poster in TFA pointed out. Even within a single loop iteration the "unoptimized" code is more likely to exploit multiple ALUs if they exist, regardless of vectorization instructions.