|
|
|
|
|
by Tostino
472 days ago
|
|
I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc). |
|