Hacker News new | ask | show | jobs
by jbellis 26 days ago
BTW the paper says

> Since only (Qdiff,Kdiff,Vdiff) are updated during training, the total number of trainable parameters is approximately 16% of the full model.

But the code defines q_proj_diff, k_proj_diff, v_proj_diff, and o_proj_diff, and it only matches 16% when you include the O term.