Because in 2x CPU system, the model may have to be passed via NUMA, which has 10% - 30% of memory bandwidth bandwidth
https://news.ycombinator.com/item?id=42868360
https://news.ycombinator.com/item?id=42868360