|
|
|
|
|
by mmoskal
633 days ago
|
|
You can do tensor parallelism 8 ways (8 KV heads). You can also do pipeline parallelism (there is 126 layers). Either way would work. A million tokens is possible just very slow. Also, 405b has 8 KV heads of 128 size (hidden_size/num_attention_heads) times 126 layers [0] times 2 (K and V) times 2 bytes (bf16) is 504k per token. At FP8 it's 252k. [0] https://huggingface.co/meta-llama/Meta-Llama-3.1-405B/blob/m... |
|