|
|
|
|
|
by nl
1157 days ago
|
|
LLama 65B used 2048 80G A100s for 21 days: > When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM.[1] Note that you probably need to budget for double to triple that because things go wrong and it usually takes multiple starts to get a good training run. Smaller models are cheaper though. [1] https://arxiv.org/pdf/2302.13971.pdf |
|