I mean faster runtime performance, I have no clue about compilation time.
Well I'm basing this on the countless benchmarks I've seen, e.g. on phoronix over the decade.
Also you have to understand that Clang -O2 is (was) "unfair" as GCC did not enable autovectorization until -O3. This has (is being?) changed.