|
|
|
|
|
by microtonal
2016 days ago
|
|
There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that. E.g. the ALBERT model does that: https://arxiv.org/abs/1909.11942 I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well. E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy. |
|