CMU lecture notes [0] I think approach it in an intuitive way, starting from the Gaussian noise linear model, deriving log-likelihood, and presenting the analytic approach. Misses the bridge to gradient methods though.
For gradients, Stanford CS229 [1] jumps right into it.