|
|
|
|
|
by littlestymaar
603 days ago
|
|
> removing it was probably more of a hack to make things parallelizable But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization. |
|