|
|
|
|
|
by bitL
1195 days ago
|
|
You are missing the point. Extremely large LLMs don't train the same way as your BERT_Large x8 variety of LLMs. Your whole training procedure is different. Also Microsoft spent so much initially because their Azure Cloud was unable to cope with it electrically and they had to rewire a datacenter for it. So it's not even a question of just renting 1000 GPUs. Do you have actual experience training GPT-3+ sized models? |
|
Quotes from the paper: Our model is trained using a codebase that builds on Megatron (Shoeybi et al., 2020) and DeepSpeed (Rasley et al., 2020) to facilitate efficient and straightforward training of large language models with tens of billions of parameters. We use the official PyTorch v1.10.0 release binary package compiled with CUDA 11.1. This package is bundled with NCCL 2.10.3 for distributed communications.
We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links—compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch.
And if you are interested in 176B-scale training, read the BLOOM-176B and OPT-175B papers and research logs.