My intuition would be that you get more orthogonal directions to the gradient (of previous samples) if you have larger model.