|
|
|
|
|
by rcxdude
521 days ago
|
|
This would require a revolution in the algorithms used to train a neural net: currently LLM training is at best distributed amongst GPUs in racks in the same datacenter, and ideally nearby racks, and that's already a significant engineering challenge, because each step needs to work from the step before, and each step updates all of the weights, so it's hard to parallelise. You can do it a little bit, because you can e.g. do a little bit of training with part of the dataset on one part of the cluster, and another part elsewhere, but this doesn't scale linearly (i.e. you need more compute overall to get the model to converge to something useful), and you still need a lot of bandwidth between your nodes to synchronize the networks frequently. All of this makes it very poorly suited to a collection of heterogeneous compute connected via the internet, which wants a large chunk of mostly independent tasks which have a high compute cost but relatively low bandwidth requirements. |
|