Hacker News new | ask | show | jobs
by sliken 1173 days ago
Anyone know exactly config is needed for training, presumably the $300 is on some GPU heavy instance on ec2? Is $300 of Ec2 p4de.24xlarge which is $40.96 an hour? Or maybe 7-8 nodes for an hour? Something else?
2 comments

From the post:

> The training was done with PyTorch FSDP on 8 A100 GPUs in one day.

> We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.

So, this is using for example a2-ultragpu-8g (8x A100-80GB) on GCP using spot instances. You can use SkyPilot to quickly see the price is $12.8 per hour (~$307 for a day):

» sky launch --gpus A100-80GB:8 --use-spot

Check out detailed CLI instructions and SkyPilot YAMLs here if you want to give it a try:

- https://github.com/lm-sys/FastChat#vicuna

- https://github.com/lm-sys/FastChat/blob/main/scripts/train-v...

What’s wrong with vast.ai? You’d be probably looking at $2/hr. So like $50 for the whole fine-tuning.
Their mention of SkyPilot isn't an accident, it seems to be a "find me cheap spot instances" project: https://github.com/skypilot-org/skypilot#readme and as best I can tell that's what the yaml files in their repo are for: https://github.com/lm-sys/FastChat/blob/main/scripts/train-v...