|
|
|
|
|
by spi
1144 days ago
|
|
The "standard" machine for these things has 8x80GB = 640GB memory (p4de instances here: https://aws.amazon.com/ec2/instance-types/p4/), with _very_ fast connections between GPUs. This fits even a large model comfortably. Nowadays probably most training use half precision ("bf16", not exactly float16, but still 2 bytes per parameter). However during training you easily get a 10-20x factor between the number of parameters and the bytes of memory needed, due to additional things you have to store in memory (activations, gradients, etc.). So in practice the largest models (70-175B parameters) can't be trained even on one of these beefy machines. And even if you could, it would be awfully slow. In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible... To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-) |
|
[1] - https://instances.vantage.sh/