The claim is parallelism for training which is not fixed speed up, different complexity for inference (constant time), and different complexity for large context inference (linear) - so nothing that can be summarised as 8x - or am I getting this summary wrong?