|
|
|
|
|
by dekhn
1607 days ago
|
|
Systems like this are designed to reach nearly peak performance (IE # of flops per processing element * # of processing elements), explicitly by making a network that won't block or increase latency for the common expensive operations (allreduce, allvall) at the expensive of greatly increased cost. The point of making this machine is to have a lot of A100s going at the same time, and that will unblock some small set of researchers who are working on time-sensitive competitive research projects by giving them a slightly throughput and latency advantage on the largest problems. The vast majority of users would be better served by a small number of cheaper, slower GPUs that they had exclusive access to for the longest time period they could afford to wait. |
|
The system certainly is. The code running on that system generally isn't. Pulling 100% of the FLOPS the GPUs are able to provide is quite hard.
And my point was it also depends on the specific models you are training. Are you training a transformer model in FP32 precision? Then yes, 6K A100 will blow 10K V100. Are you training a ConvNet in FP16? Then no, 10K V100 will perform better.
The GPUs have different architecture, you have to use the architecture best suited for the A100 to achieve the speedup marketed by NVidia, which is presumably the number FB is using to claim that their 6k GPU cluster is bigger than OpenAI's 10K one.