| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by neilmovva 929 days ago
	promising results, excited to try it out! question on the perf benchmarks: why do all the results with 2 GPUs & DDP take longer than the single GPU case? Both benchmarks do the same amount of work, one training epoch, so this negative scaling is surprising.

1 comments

danielhanchen 929 days ago

So there's 2 main reasons:

1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0.

2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster.

link

neilmovva 929 days ago

I agree that synchronization causes overhead, so 2x GPUs won't achieve the ideal 0.5x total runtime. But here, taking your Alpaca benchmark as an example, we are seeing 2x GPUs get 3.6x runtime with Huggingface, or 1.15x with Unsloth Max.

In other words, every benchmark, in either HF or Unsloth, is slower in absolute terms when going from 1 to 2 GPUs. That makes me think something is wrong with the test.

Could you share your benchmark code?

link

danielhanchen 929 days ago

You can refer to QLoRA's official finetuning notebook https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zb... for your reference!! Obviously I can't provide the code we have, but if you use the same datasets and the same settings (bsz = 2, ga = 4, max_grad_norm = 0.3, num_epochs = 1, seed = 3407, max_seq_len = 2048) you should be able to replicate it.

link