Hacker News new | ask | show | jobs
by andersa 655 days ago
Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.
1 comments

interesting, thank you for the pointers.