Hacker News new | ask | show | jobs
by dumitrue 4744 days ago
It's possible that the underlying model is just not particularly good at learning from data. 11B parameters is a lot of free parameters to learn -- for instance, the main competitor to that paradigm is the work by Krizhevsky et al., which are convolutional networks with lots of parameter sharing, and I think they get better performance (on a comparable task) with ~60M free parameters.