Hacker News new | ask | show | jobs
by YetAnotherNick 146 days ago
Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.