|
|
|
|
|
by riedel
1180 days ago
|
|
I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess. Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different? |
|