I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess.
Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?
I wonder how they handle illegal content. Like, if you're running training data on your computer, what's to stop someone else's data that is illegal, from being uploaded to your computer as part of training?
Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?