|
|
|
|
|
by zwaps
822 days ago
|
|
1. They compare with an older sort of standard implementation of a transformer Unsure whether the results would be equally significant compared to models with gated units or multiquery etc. 2. The difference seems to diminish with scale. Real life transformers obviously are much larger and train on many more tokens. 3. A very significant part of training transformer models are the throughoutput and memory optimizations. I wonder how their model would work with such fused kernels or specialized paged KV cache schemes. Or activation checkpointing, if run locally. 4. Indeed they claim no memory impact, but their code shows that their experiments are conducted with a special optimized version which requires all activations to reside in a single tensor at all times. Not sure this would work with 3d parallelism on multiple nodes etc. |
|