Hacker News new | ask | show | jobs
by mmoskal 633 days ago
You can do tensor parallelism 8 ways (8 KV heads). You can also do pipeline parallelism (there is 126 layers). Either way would work. A million tokens is possible just very slow.

Also, 405b has 8 KV heads of 128 size (hidden_size/num_attention_heads) times 126 layers [0] times 2 (K and V) times 2 bytes (bf16) is 504k per token. At FP8 it's 252k.

[0] https://huggingface.co/meta-llama/Meta-Llama-3.1-405B/blob/m...