| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kir-gadjello 1195 days ago

"Accomodate" is the word to scrutinize here. Yes, it will cost a lot to outright buy physical HPC infrastructure to train and infer a series of large models deployed for customers all over the globe. No, it won't cost nearly as much to rent cloud infra to train a similarly-sized model. No, you won't be able to train a large model on a single multi-GPU node, you will need a cluster containing a respectable power of two of GPUs (or other accelerators).

It's a widely known meme at this point, but to reiterate: For a popular large model, the largest part of the cost will be spent on inference, not on training. If we assume inference on end user device, this cost disappears.

And even if you have the million to rent a cluster, there is a very deep question of the optimal architecture, dataset and hyperparameters to train the best model possible under given constraints.

1 comments

bitL 1195 days ago

You are missing the point. Extremely large LLMs don't train the same way as your BERT_Large x8 variety of LLMs. Your whole training procedure is different. Also Microsoft spent so much initially because their Azure Cloud was unable to cope with it electrically and they had to rewire a datacenter for it. So it's not even a question of just renting 1000 GPUs. Do you have actual experience training GPT-3+ sized models?

kir-gadjello 1195 days ago

If you are interested in the infrastructure-level details of how similar models are trained by lesser known groups, take a look at this paper: https://arxiv.org/abs/2204.06745

Quotes from the paper: Our model is trained using a codebase that builds on Megatron (Shoeybi et al., 2020) and DeepSpeed (Rasley et al., 2020) to facilitate efficient and straightforward training of large language models with tens of billions of parameters. We use the official PyTorch v1.10.0 release binary package compiled with CUDA 11.1. This package is bundled with NCCL 2.10.3 for distributed communications.

We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links—compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch.

And if you are interested in 176B-scale training, read the BLOOM-176B and OPT-175B papers and research logs.