For one, speed and memory. They have twice as many Q and K weights in the attention blocks, leading to a ~10% reduction in throughput on their H100 (table 7 in appendix A).