| > OOO and even wider RVV registers will then automatically speed things up, without even a recompile. The problem is that there are some things in RVV where it's unclear how they will perform on high perf OoO cores: * general choice of LMUL: on in-order cores it's clear that maximizing LMUL without spilling is the best approach, for OoO this isn't clear. * How will LMUL>1 vrgather and vcompress perform? * How high is the impact of vsetvli instructions? Is it worth trying to move them outside of loops whenever possible, or is the impact minimal like in the current in-order implementations. * What is the overhead of using .vx instruction variants, is there additional cost involved in moving between GPRs and vector registers? * Is there additional overhead when reinterpreting vector masks? * What performance can we expect from the more complex load/stores, especially the segmented ones. The LLVM scheduling models give some insight: * SiFive P670: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ... * Tenstorrent Ascalon: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ... (still missing the vector part, but there is supposed to be a PR in the near future) I'm trying to collect as much info on hardware as I can: https://camel-cdr.github.io/rvv-bench-results/index.html |