Anyone know exactly config is needed for training, presumably the $300 is on some GPU heavy instance on ec2? Is $300 of Ec2 p4de.24xlarge which is $40.96 an hour? Or maybe 7-8 nodes for an hour? Something else?
> The training was done with PyTorch FSDP on 8 A100 GPUs in one day.
> We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.
So, this is using for example a2-ultragpu-8g (8x A100-80GB) on GCP using spot instances. You can use SkyPilot to quickly see the price is $12.8 per hour (~$307 for a day):
» sky launch --gpus A100-80GB:8 --use-spot
Check out detailed CLI instructions and SkyPilot YAMLs here if you want to give it a try:
> The training was done with PyTorch FSDP on 8 A100 GPUs in one day.
> We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.
So, this is using for example a2-ultragpu-8g (8x A100-80GB) on GCP using spot instances. You can use SkyPilot to quickly see the price is $12.8 per hour (~$307 for a day):
» sky launch --gpus A100-80GB:8 --use-spot
Check out detailed CLI instructions and SkyPilot YAMLs here if you want to give it a try:
- https://github.com/lm-sys/FastChat#vicuna
- https://github.com/lm-sys/FastChat/blob/main/scripts/train-v...