Hacker News new | ask | show | jobs
by GistNoesis 1144 days ago
That's why it's not parallelized along the time axis but rather along the dimension of the embedding axis.

You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...

Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.