| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dumitrue 4791 days ago
	It's possible that the underlying model is just not particularly good at learning from data. 11B parameters is a lot of free parameters to learn -- for instance, the main competitor to that paradigm is the work by Krizhevsky et al., which are convolutional networks with lots of parameter sharing, and I think they get better performance (on a comparable task) with ~60M free parameters.