> As to the motivation for the correct step: can you point me to a resource that explains this? Not sure I follow...
You write an equation involving division by the gradient. This is an illegal operation (one cannot divide by a vector), and your final recipe doesn't do it. As far as I can tell, you are writing down the incorrect, illegally-vector-inverting formula as motivation for the correct formula involving the (inverse of the) Hessian. All I am suggesting is that you say explicitly something like "Of course, this formula as written is not literally correct; one cannot actually divide by a vector. The correct procedure is explained below."
(Incidentally, speaking of inverses, another poster (https://news.ycombinator.com/item?id=14881265) has mentioned that it may be a bit confusing to speak of the inverse of a matrix rather than the reciprocal, since (as I interpret that other poster's point) the reciprocal of a matrix is just its inverse. I might prefer to say something like "We write $H_{\ell(\theta)}^{-1}\nabla\ell(\theta)$ rather than $\frac{\nabla\ell(\theta)}{H_\ell(\theta)}$ to emphasise that we are inverting a matrix, not a scalar, so that the order of multiplication matters.")
Bishop has a nice treatment of Newton's method in "Pattern recognition and machine learning". Good book to have on your shelf of you are learning this stuff.
You write an equation involving division by the gradient. This is an illegal operation (one cannot divide by a vector), and your final recipe doesn't do it. As far as I can tell, you are writing down the incorrect, illegally-vector-inverting formula as motivation for the correct formula involving the (inverse of the) Hessian. All I am suggesting is that you say explicitly something like "Of course, this formula as written is not literally correct; one cannot actually divide by a vector. The correct procedure is explained below."
(Incidentally, speaking of inverses, another poster (https://news.ycombinator.com/item?id=14881265) has mentioned that it may be a bit confusing to speak of the inverse of a matrix rather than the reciprocal, since (as I interpret that other poster's point) the reciprocal of a matrix is just its inverse. I might prefer to say something like "We write $H_{\ell(\theta)}^{-1}\nabla\ell(\theta)$ rather than $\frac{\nabla\ell(\theta)}{H_\ell(\theta)}$ to emphasise that we are inverting a matrix, not a scalar, so that the order of multiplication matters.")