| HN Mirror

I would use SGD when I can tolerate an inexact estimator. For problems when I need an exact MLE (because, for example, I need to invoke asympotic normality to get CI's), I can usually do exact computations using less memory than IRLS would require.

In particular, what I had in mind is that most black-box optimization tools (e.g. L-BFGS) only need an implementation of a function and its gradient -- both of which can be accumulated iteratively using O(P) memory and O(N * P) time per iteration for any data store that can stream all N observations. Given code that can accumulate function values and gradients, an optimization method like L-BFGS can estimate parameters for many kinds of models. In my experience, there's often no need to materialize the kinds of dense matrices that are typically used in R's default approach to GLM model fitting.