Hacker News new | ask | show | jobs
by ryao 567 days ago
Inference workloads likely won’t care very much. For llama 3.1 405B with bf16 when you split the workload across GPUs by layer, you need to do a 32KB memory copy before the next GPU can begin processing. That can be done incredibly quickly over PCI-E.