|
|
|
|
|
by covi
1180 days ago
|
|
From the post: > The training was done with PyTorch FSDP on 8 A100 GPUs in one day. > We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300. So, this is using for example a2-ultragpu-8g (8x A100-80GB) on GCP using spot instances. You can use SkyPilot to quickly see the price is $12.8 per hour (~$307 for a day): » sky launch --gpus A100-80GB:8 --use-spot Check out detailed CLI instructions and SkyPilot YAMLs here if you want to give it a try: - https://github.com/lm-sys/FastChat#vicuna - https://github.com/lm-sys/FastChat/blob/main/scripts/train-v... |
|