Hacker News new | ask | show | jobs
by Tostino 472 days ago
I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).
1 comments

We trained all the parameters. Those would definitely be interesting ablations. I would also like to see how much of a performance hit we would take with PEFT methods like LoRA.