Hacker News new | ask | show | jobs
by mirekrusin 1068 days ago
The claim is parallelism for training which is not fixed speed up, different complexity for inference (constant time), and different complexity for large context inference (linear) - so nothing that can be summarised as 8x - or am I getting this summary wrong?
1 comments

the words per second i believe from the first graph in the paper