Are there any papers or blogs about how these GPUs are attached to the host? I find it interesting that you can get a VM with 96 vCPUs, which I assume amounts to a whole box (2x24-core hyperthreaded Xeon CPUs?) but either 8 or 16 GPUs. How does that keep from stranding 8 GPUs? Is there some kind of rack-wide PCIe switch that can attach GPUs to various hosts or ??
We sadly don’t talk about how we rack these at all, but the folks at Facebook have made their OCP designs public for vaguely similar systems.
However, I’ll note that the 16 A100s here are way more expensive than the cpu cores (and we can just run vanilla VMs on those left over cores if really needed).
Worse than that, "A"/"a" on AWS can actually mean AMD or ARM. An "a1.large" instance is an instance with a first-gen Graviton ARM CPU (whereas the second-gen Graviton2 ARM CPUs are something like "c6g.large"). A "c5a.large" is an instance with a x86 AMD CPU.
Even worse is that for GCE, A was for AMD originally (and N was for iNtel). In any case, this A is for Accelerator.