| This is actually a Really good question, I should really expand on this in the guide. The simplicity of the example is deceiving you, in the sense that in a more complex circuit (than just x * y), it will no longer be the case that just simply marching along the direction of the gradient in huge steps will necessarily lead to good outcome. All we can do is work very locally. You are at some specific point (x,y)=(-2,3). The gradient is telling you that the best direction (in this 2-D space) to go along if you're interested in increasing your function is in the direction (3, -2), which is the gradient at that point. Now if you remember there's a crucial parameter step_size, and the update will take the form (in vector notation): (x,y) += step_size * (dx, dy) all that the gradient is saying is that if step_size is infinitesimally small, your function will increase, and that this direction is the Fastest increase. If you try any other direction with infinitesimally small step_size, you will do worse. However, there are no guarantees whatsoever on what happens when the step_size is larger. In practice we use step sizes of 0.001, or whatever (small numbers), and it works only because the functions we deal with are relatively smooth and well-behaved. However, after we take the small step we have to right away re-evaluate the gradient because suddenly the direction could be all shifted. Sometimes doing gradient descent is compared to walking blind on a hill and trying to get to the top: you can sense the steepness of the hill at your feet (the gradient), and you make steps accordingly. But if you make too large of a step and you're not careful, there could have been actually been large drop. TLDR: x * y is very simple example, this certainly wouldn't be the case in more complicated circuit. |