| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by riedel 1180 days ago
	I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess. Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?