|
|
|
|
|
by dumitrue
4744 days ago
|
|
It's possible that the underlying model is just not particularly good at learning from data. 11B parameters is a lot of free parameters to learn -- for instance, the main competitor to that paradigm is the work by Krizhevsky et al., which are convolutional networks with lots of parameter sharing, and I think they get better performance (on a comparable task) with ~60M free parameters. |
|