|
|
|
|
|
by brutus1213
1144 days ago
|
|
I'm really curious how Meta, DeepMind and OpenAI make the big models work. The biggest A100 you can buy is just 80GB. And I assume the big companies use single precision floating point during training. Are they actually partitioning the big model across multiple GPU instances? If one had the hardware, how many GPUs does the biggest LLAMA take? These are systems issues and I have not read papers or blog posts on how this works. To me, this infra is very non-trivial. |
|
In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible...
To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)