|
|
|
|
|
by JackRumford
990 days ago
|
|
"With FLASHATTENTION (Dao et al., 2022), there is negligible GPU memory overhead as we increase the sequence length and we observe around 17% speed loss when increasing the sequence length from 4,096 to 16,384 for the 70B model." "For the 7B/13B models, we use learning rate 2e−5 and a cosine learning rate schedule with 2000 warm-up steps. For the larger 34B/70B models, we find it important to set a smaller learning rate (1e−5) to get monotonically decreasing validation losses." "In the training curriculum ablation study, models trained with a fixed context window of 32k from scratch required 3.783 × 10^22 FLOPs and achieved performance metrics like 18.5 F1 on NarrativeQA, 28.6 F1 on Qasper, and 37.9 EM on Quality." "Continual pretraining from short context models can easily save around 40% FLOPs while imposing almost no loss on performance." "Through early experiments at the 7B scale, we identified a key limitation of LLAMA 2’s positional encoding (PE) that prevents the attention module from aggregating information of distant tokens. We adopt a minimal yet necessary modification on the RoPE positional encoding (Su et al., 2022) for long-context modeling – decreasing the rotation angle." Pretty exciting stuff. Getting close to GPT-4 hopefully soon! |
|