| These helpful, well-written slides help explain where the "gradient" comes into "Gradient Boosting": http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/sl... The gist of it is: when you add a new decision tree that fits to the residual error, this new tree is fitting to the negative gradient of the loss function (ie training error). Thus, adding the new decision tree to your existing ensemble takes a gradient-descent step that seeks to minimize the loss function (ie training error). Boosting comes in because the model is combining several weak learners/models (individual trees) into a strong learner (ensemble of trees).
Each individual tree breaks up the input space into piecewise-constant regions that best approximate the target function. This representation will incur some error - thus, a new tree is fit to minimize the error over the entire input space, ie by breaking up the input space into piecewise-constant regions, etc. So, it's boosting not in the traditional Adaboost sense: where the final model is a linear combination of "dumb" classifiers. Instead, I'd liken it more to a cascade method: each tree T_{n} seeks to fix the errors from the previous tree T_{n-1}:
https://en.wikipedia.org/wiki/Cascading_classifiers There's actually a cool facial landmark detector that uses this same cascading idea to train an extremely fast (and quite accurate) system. In essence, they use a cascade of random forests (in a gradient-boosting framework) to detect landmarks. The dlib library has a great implementation, along with a pretrained model. I've used it in my research, and while not perfect, have been satisfied with its results:
http://blog.dlib.net/2014/08/real-time-face-pose-estimation.... http://www.cv-foundation.org/openaccess/content_cvpr_2014/pa... |