Hacker News new | ask | show | jobs
by FuckButtons 653 days ago
AFAIK, the main bottleneck on training is memory bandwidth. Distributed gpu compute has multiple orders of magnitude less than an equivalent number of GPUs colocated, because they don’t share a physical bus, but have a network connection instead. This work improves on that, but the fundamental limitations remain.