| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JackRumford 1037 days ago

"With FLASHATTENTION (Dao et al., 2022), there is negligible GPU memory overhead as we increase the sequence length and we observe around 17% speed loss when increasing the sequence length from 4,096 to 16,384 for the 70B model."

"For the 7B/13B models, we use learning rate 2e−5 and a cosine learning rate schedule with 2000 warm-up steps. For the larger 34B/70B models, we find it important to set a smaller learning rate (1e−5) to get monotonically decreasing validation losses."

"In the training curriculum ablation study, models trained with a fixed context window of 32k from scratch required 3.783 × 10^22 FLOPs and achieved performance metrics like 18.5 F1 on NarrativeQA, 28.6 F1 on Qasper, and 37.9 EM on Quality."

"Continual pretraining from short context models can easily save around 40% FLOPs while imposing almost no loss on performance."

"Through early experiments at the 7B scale, we identified a key limitation of LLAMA 2’s positional encoding (PE) that prevents the attention module from aggregating information of distant tokens. We adopt a minimal yet necessary modification on the RoPE positional encoding (Su et al., 2022) for long-context modeling – decreasing the rotation angle."

Pretty exciting stuff. Getting close to GPT-4 hopefully soon!