There are different ways to run LLMs on multiple GPUs, one of them (called tensor parallelism) in low batch scenarios would be multiplying bandwidth between different GPUs. So no, 8 4090s is not 1000 GB/s.
I’m developing inference engine, so I actually do understand how it works. As well as other types of parallelism and how exactly they do different trade offs