|
|
|
|
|
by whymauri
781 days ago
|
|
From https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradie... >We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on Crusoe Energy high performance L40S cluster. >Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below). My understanding is you need multiple GPUs to coordinate ring-attention for the long context window. |
|