Hacker News new | ask | show | jobs
by YetAnotherNick 637 days ago
> The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights

Backward pass still runs on the non adapter weights. But yeah 10 TFlops/GPU specially on tiny sequence size is very bad compared to what you can get on Nvidia. And I believe the difference would be even higher with large sequence length.

1 comments

backward activations does but typically not backwards weight gradients.

Why compute gradients with regards to weights that aren't going to be updated?