|
|
|
|
|
by YetAnotherNick
637 days ago
|
|
> The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights Backward pass still runs on the non adapter weights. But yeah 10 TFlops/GPU specially on tiny sequence size is very bad compared to what you can get on Nvidia. And I believe the difference would be even higher with large sequence length. |
|
Why compute gradients with regards to weights that aren't going to be updated?