Y
Hacker News
new
|
ask
|
show
|
jobs
by
andersa
655 days ago
Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.
1 comments
StrangeDoctor
655 days ago
interesting, thank you for the pointers.
link