|
|
|
|
|
by jbellis
26 days ago
|
|
BTW the paper says > Since only (Qdiff,Kdiff,Vdiff) are updated during training, the total number of trainable parameters is approximately 16% of the full model. But the code defines q_proj_diff, k_proj_diff, v_proj_diff, and o_proj_diff, and it only matches 16% when you include the O term. |
|